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Abstract 

We consider a problem of atomic broadcast in a dynamic setting where processes may join, 
leave voluntarily, or fail (by stopping) during the course of computation. We provide a formal 
definition of the Dynamic Atomic Broadcast problem and present and analyze a new algorithm 
for its solution in a synchronous system, where processes have approximately synchronized 
clocks. 

Our algorithm exhibits constant message delivery latency in the absence of failures, even 
during periods when participants join or leave. To the best of our knowledge, this is the first 
algorithm for totally ordered multicast in a dynamic setting to achieve constant latency bounds 
in the presence of joins and leaves. When failures occur, the latency bound is linear in the 
number of actual failures. 

Our algorithm uses a solution to a variation on the standard distributed consensus problem, 
in which participants do not know a priori who the other participants are. We define the 
new problem, which we call Consensus with Unknown Participants, and give an early-deciding 
algorithm to solve it. 



1 Introduction 

We consider a problem of atomic broadcast in a dynamic setting where an unbounded number of 
participants may join, leave voluntarily, or fail (by stopping) during the course of computation. 
We formally define the Dynamic Atomic Broadcast (DAB) problem, which is an extension of the 
Atomic Broadcast problem [17] to a setting with infinitely many processes, any finite subset of 
which can participate at a given time. Just as Atomic Broadcast is a basic building block for state 
machine replication in a static setting [20, 27], DAB can serve as a building block for state machine 
replication among a dynamic set of processes. 

We present and analyze a new algorithm, which we call Atom, for solving the DAB problem in 
a synchronous crash failure model. Specifically, we assume that the processes solving DAB have 
access to approximately-synchronized local clocks and to a lower-level dynamic network that guar- 
antees timely message delivery between currently active processes. The challenge is to guarantee 
consistency among the sequences of messages delivered to different participants, while still achieving 
timely delivery, even in the presence of joins and leaves. 

Atom exhibits constant message delivery latency in the absence of failures, even during periods 
when participants join or leave; this is in contrast to previous algorithms solving similar problems 
in the context of view-oriented group communication, e.g., [1, 9]. When failures occur, Atom's 
latency bound is linear in the number of failures that actually occur; it does not depend on the 
number of potential failures, nor on the number of joins and leaves that occur. 

A key difficulty for an algorithm solving DAB is that when a process fails, the network does 
not guarantee that the surviving processes all receive the same messages from the failed process. 
But the strong consistency requirements of DAB dictate that processes agree on which messages 
they deliver to their clients. The processes carry out a protocol to coordinate message delivery, 
which works roughly as follows: Each Atom process divides time into slots, using its local clock, 
and assigns each message sent by its client to a slot. Each process delivers messages to its client 
in order of slots, and within each slot, in order of sender identifiers. Each process determines the 
membership of each slot, and delivers messages only from senders that it considers to be members 
of the slot. To ensure consistency, the processes must agree on the membership of each slot. 

Processes joining (or voluntarily leaving) the service coordinate their own join (or leave) by 
selecting a join-slot (or leave-slot) and informing the other processes of this choice, without delaying 
the normal delivery of messages. When a process fails, Atom uses a novel distributed consensus 
service to agree upon the slot in which it fails. The consensus service required by Atom differs from 
the standard stopping-failure consensus services studied in the distributed algorithms literature 
(see, e.g., [21]) in that the processes implementing the consensus service do not know a priori who 
the other participants are. Atom tracks process joins and leaves, and uses this information to 
approximate the active set of processes that should participate in consensus. However, different 
processes running Atom may have somewhat different perceptions of the active set, e.g., when a 
participant joins or leaves Atom at roughly the time consensus is initiated. 

In order to address such uncertainties, we define a new consensus service, consensus with un- 
known participants (CUP). When a process i initiates CUP, it submits to CUP a finite set Wi 
estimating the current world, in addition to ?'s proposed initial consensus value V{. The worlds 
suggested by different participants do not have to be identical, but some restrictions are imposed 
on their consistency. Consider, e.g., the case that process k joins Atom at roughly the time CUP 
is initiated. One initiator, i, may think that k has joined in time to participate and include k in 
Wi, while another, j, may exclude k from Wj. Process k cannot participate in the CUP algorithm 
in the usual way, because j would not take its value into account. On the other hand, if k does not 



participate at all, i could block, waiting forever for a message from k. We address such situations 
by allowing k to explicitly abstain from an instance of CUP, i.e., to participate without providing 
an input. A service that uses CUP must ensure that for every i, (1) W% includes all the processes 
that ever initiate this instance of CUP (unless they fail or leave prior to ?'s initiation); and (2) if 
j E Wi, (and neither i nor j fail or leave), then j participates in CUP either by initiating or by 
abstaining. Thus, W* sets can differ only in the inclusion of processes that abstain, leave, or fail. 

Note that once an instance of CUP has been started, no new processes (that are not included 
in Wi) can join the running instance. Nevertheless, CUP provides a good abstraction for solving 
DAB, because Atom can invoke multiple instances of CUP with different sets of participants. 

We give an early-deciding algorithm to solve CUP in a fail-stop model [26], that is, in a time- 
free crash failure model where processes are equipped with perfect failure detectors [5]. The failure 
detector is external to CUP; it is implemented by Atom. CUP uses a strategy similar to previous 
early-deciding algorithms for consensus with a predetermined set of participants [13], but it also 
tolerates uncertainty about the set of participants, and moreover, it allows processes to leave 
voluntarily without incurring additional delays. The time required to reach consensus is linear in 
the number of failures that actually occur during an execution, and does not depend on an upper 
bound on the number of potential failures, nor on the number of processes that leave. 

We also analyze the message-delivery latency of Atom under different failure assumptions. We 
show a constant latency bound for periods when no failures occur, even if joins and leaves occur. 
When failures occur, the latency is proportional to the number of actual failures. This is inevitable: 
atomic broadcast requires a number or rounds that is linear in the number of failures (see [2]). 

We envision a service using Atom, or a variation of it, deployed in a large LAN, where latency 
is predictable and message loss is bounded. In such settings, a network with the properties we 
assume can be implemented using forward error correction (see [3]), or retransmissions (see [28]). 
The algorithm can be extended for use in environments with looser time guarantees, e.g., networks 
with differentiated services; we outline ideas for such an extension in Section 7.7. 

In summary, this paper makes the following main contributions: (1) the definitions of two new 
problems for dynamic networks, expressed by the DAB and CUP services; (2) an early-delivery DAB 
algorithm, Atom, which exhibits constant latency in the absence failures; (3) a new early-deciding 
algorithm for solving CUP in a fail-stop model; and (4) the analysis of Atom's message-delivery 
latency under various failure assumptions. 

The rest of this paper is organized as follows: Section 2 discusses related work. In Section 3, 
we specify the DAB service. In Section 4 we specify CUP and in Section 5, we present the CUP 
algorithm and its analysis. We then turn to the presentation of Atom: Section 6 specifies the 
environment and model assumptions for Atom, and Section 7 contains a detailed presentation of 
the Atom algorithm and its analysis. Section 8 concludes the paper. The Appendix contains 
rigorous correctness proofs for both CUP and Atom. 

2 Related Work 

A dynamic universe, where processes join and leave, was first considered in the context of view- 
oriented group communication work [7], pioneered by the Isis [4] system. The first analysis of 
time bounds of message delivery in synchronous group communication systems was performed by 
Cristian [9]. Our service resembles the services provided by group communication systems; although 
we do not export membership to the application, it is computed, and would be easy to export. 

View-oriented group communication systems, including systems designed for synchronous sys- 
tems and real-time applications (e.g., Cristian's [9], xAMp [25], and RTCAST [1]), generally run 



a group membership protocol every time a process joins or leaves, and therefore delay message 
delivery to all processes when joins or leaves occur. Cristian's system uses an atomic broadcast 
primitive to agree upon group membership. Since, unlike CUP, the atomic broadcast service works 
with a static universe, a process join has to be agreed upon before any new membership change 
is handled (voluntary leaves are not considered). Therefore, Cristian's service exhibits constant 
latency only in periods in which no joins or failures occur. Latency during periods with multiple 
joins is not analyzed. xAMp is a group communication system supporting a variety of communica- 
tion primitives for real-time applications. The presentation of xAMp in [25] focuses on the various 
communication primitives and assumes that a membership service is given. The delays due to fail- 
ures and joins are incurred in the membership part, which is not described or analyzed. RTCAST 
is a real-time group communication system, for which a detailed analysis of membership latency 
was conducted [1]. The latency bound achieved by RTCAST is linear in the number of processes, 
even when no process fails, due to the use of a logical ring. Moreover, RTCAST makes stronger 
assumptions about its underlying network than we do - it uses an underlying reliable broadcast 
service that guarantees that correct processes deliver the same messages from faulty ones; the cost 
of this primitive is not considered in the analysis. 

Some group membership services avoid running the full-scale membership for join and leaves 
by using light-weight group membership [15] services; they use an atomic broadcast service to 
disseminate join and leave messages in a consistent manner, without running the full-scale group 
membership algorithm. However, unlike our CUP service, the atomic broadcast service such systems 
use do not tolerate uncertainty about the set of participants. Therefore, a race condition between 
a join and a concurrent failure can cause such light-weight group services (e.g., [23, 12, 15]) to 
violate the semantics of the underlying heavy-weight membership services. Those light-weight 
group services that do preserve the underlying heavy-weight membership semantics (e.g. [24]), do 
incur extra delivery latencies whenever joins and leaves occur. 

Other work on group membership in synchronous and real-time systems, e.g., [19, 18] has focused 
on membership maintenance in a static, fairly small, group of processes, where processes are subject 
to failures but no new processes can join the system. Likewise, work analyzing time bounds of 
synchronous atomic broadcast, e.g. [16, 10, 8], considered a static universe, where processes could 
fail but not join. Thus, this work did not consider the DAB problem. 

In a previous paper [3] , we considered a simpler problem of dynamic totally ordered broadcast 
without all or nothing semantics. For this problem, the linear lower bound does not apply, and we 
exhibited an algorithm that solves the problem in constant time even in the presence of failures. 

Recent work [22, 6] considers different services, including (one shot) consensus, for infinitely 
many processes in asynchronous shared memory models. Chockler and Malkhi [6] present a con- 
sensus algorithm for infinitely many processes using a static set of active disks, a minority of which 
can fail. This differs from the model considered here, as in our model all system components may 
be ephemeral. Merritt and Taubenfeld [22] study consensus under different concurrency models; in 
their terminology, our model assumes unbounded congruency and [1, oo]-participation, which means 
that at least one process must participate and there is no bound on the number of participants. 
They show that with these assumptions, in an asynchronous shared memory model, infinitely many 
bits are required in order to solve consensus. The algorithms they give are not fault tolerant (they 
tolerate only initial failures). To the best of our knowledge, atomic broadcast has not been con- 
sidered in a similar context. Moreover, these problems were not considered in message-passing 
models, and it is not clear that a canonical transformation from the shared memory model the 
message-passing model applies to a setting with infinitely many processes. 



3 Dynamic Atomic Broadcast Service Specification 

We now present the DAB service specification. Our universe consists of an infinite ordered set of 
endpoints, I. The specification of DAB is parameterized by a message alphabet, M. The signature 
of the DAB(M) service is presented in Figure 1. 

Input: 

join^, leave^, fail^, iGI 
mcast ^ (m) , mGM, i€I 

Output: 

joinJDK^, leaveJDK^, i€I 
rcv^(m), mGM, iGI 

Figure 1: The signature of the DAB(M) service. 

We do not consider recoveries from failure or rejoining after leaving. In other words, there 
cannot be multiple "incarnations" at a single endpoint. Instead of new incarnations, consider the 
same client joining at new endpoints. 

Assumptions about the application: DAB(M) assumes that its application satisfies the fol- 
lowing safety conditions: 

• For each i E I: 

— At most one joinj and at most one leavej occur. 

— If leave j occurs, then it is preceded by joinJDKj. 

— Any mcastj(m) has a preceding join_0Kj but no preceding leavej or failj. 

• At most one mcast (m) occurs for each particular ra. 

DAB guarantees: Given an application that satisfies the above constraints, DAB(M) satisfies 
the properties we now specify. 

We first specify some basic integrity properties, both safety and liveness. We later specify the 
properties related to the ordering and reliability of messages. 

Basic safety properties: 

• Join/leave integrity: For each i: 

— At most one join_0Kj and at most one leave_0Kj occur. 

— If join_0Kj occurs then it is preceded by joinj. 

— If leave_0Kj occurs then it is preceded by leavej. 

• Message integrity: 

— No two rcvj(m) actions occur for the same m and j. 

— If rcvj(m) occurs for some j then it is preceded by mcast j(m) for some i. 



Basic liveness properties: 

• Eventual join: If join, occurs then either failj or join_0Kj occurs. 

• Eventual leave: If leavej occurs then either failj or leaveJDKj occurs. 

To specify the ordering and reliability guarantees of DAB, we require that there be a total 
ordering S on all the messages received by any of the endpoints, such that for all i E I, the 
following properties are satisfied. 

Safety properties: 

• Multicast order: If mcastj(m) occurs before mcastj(m'), then ra precedes ra' in S. 

• Receive order: If rcvj(m) occurs before rcv^Cm') then ra precedes ra' in S. 

• Multicast gap-freedom: If mcast j (m) , mcastj(m'), and incas-tiCm' ') occur, in that order, and 
S contains ra and ra", then S also contains ra'. 

• Receive gap-freedom: If S contains ra, ra', and ra", in that order, and rcvj (m) and rcvj (m J ' ) 
occur, then rcvj(m') also occurs. 

Liveness property: 

• Multicast liveness: If mcastj(m) occurs and no failj occurs, then S contains ra. 

• Receive liveness: If S contains ra, ra is sent by i and i does not leave or fail, then rcvj (m) 
occurs, and for every ra' that follows ra in S, rcvjCm') also occurs. 

4 Consensus with Unknown Participants - Specification 

In this section we define the problem of Consensus with Unknown Participants (CUP). CUP is 
an adaptation of the problem of fail-stop uniform consensus to a dynamic setting in which the 
set of participants is not known ahead of time, and in which participants can leave the algorithm 
voluntarily after initiating it. Moreover, participants are not assumed to initiate at the same time. 
CUP uses an underlying reliable network, and a perfect failure detector. 

We begin with a description of CUP's external signature (interface). We then specify the 
assumptions that CUP makes about its environment, including the application, the underlying 
network, and the external failure detector. We separate these into safety and liveness assumptions. 
Finally, we specify CUP's safety and liveness guarantees. CUP's safety guarantees depend on only 
the safety assumptions, that is, they are not allowed to be violated even if the liveness assumptions 
do not hold. On the other hand, CUP's liveness guarantees depend on both the safety and liveness 
assumptions. 

4.1 External Signature 

The CUP specification uses the following data types: 

• 7, an infinite ordered set of endpoints. Each endpoint in I corresponds to a potential partic- 
ipant in CUP. 

• V, a totally ordered set of values. Initial values and decision values are elements of V. 



Input: 

init^(v,W), v G V, W C I, W finite, i £ I // i initiates with value v, world W 

abstain^ , i G I // i abstains 

net _xcv^ (m) , m G M^p , i G I // i receives message m 

leave ^ , i G I // i leaves 

leave_detect^(j) , j, i G I // i detects that j has left 

fail^, i G I //i fails 

f ail_detect^(j ) , j, i G I // i detects that j has failed 

Output: 



decide^ (v), v G V, i G I 
net_mcast^ (m) , m G MQjp, i G I 



// i decides on value v 
// i multicasts m 



Figure 2: The signature of CUP. 




Figure 3: Interface diagram for CUP. 

• Mcvp-, a message alphabet. 

The external signature of CUP is presented in Figure 2, and depicted in Figure 3. 

The interface describes four kinds of interaction: "normal" interaction with clients of the CUP 
service, interaction with a multicast network, communication involving leaves and leave detection, 
and communication involving failures and failure detection. 



Normal interaction with clients: A process may participate in the CUP service in two ways: 
it may provide an initial value, in which case we say that the process initiates CUP, or it may 
decline to provide an initial value, in which case we say that it abstains. Participant i € I initiates 
CUP using the initj(v,W) action. Here, v is «'s initial value, and W is its initial world, that is, 
the set of processes that i expects to participate in CUP. Participant i abstains using the abstain, 
action. Informally speaking, a participant abstains when it does not need to participate in CUP, but 
because of uncertainty about CUP participants, some other participant may expect it to participate. 



An environment assumption ensures that, if any process expects i to participate in CUP, i will in 
fact participate, unless it leaves or fails. CUP reports the consensus decision value to process i 
using the decidej(v) action. 

Multicast network: The network interface consists of the netjncast and netjrcv actions. 

Leaves: A participant can leave the CUP service voluntarily using the leavej action. We assume 
that the environment provides a leave detector, the leave_detectj(j) action is used to notify i 
that j has left the algorithm voluntarily. 

Failures: The f ailj action represents the failure of endpoint i. We assume that the environment 
provides a failure detector, which uses the fail_detectj(j) action to notify i that j has failed. 

4.2 Environment Assumptions 

Here we list and explain the assumptions that CUP makes about its environment. We classify these 
as safety and liveness assumptions. Formally, each of the properties given here is a trace property 
([21, Ch. 8]). 

4.2.1 Safety assumptions 

The first assumption expresses simple well-formedness conditions saying that each participant be- 
gins participating (by initiating or abstaining) at most once, leaves at most once, and fails at most 
once. 

• Well-formedness: For any i e 7, 

1. At most one initj or abstain, event 1 occurs. 

2. At most one leavej event occurs. 

3. At most one f ailj event occurs. 

4. No leavej or f ailj precedes an initj. 

The next assumption says that, while the worlds W suggested by different participants in their init 
events do not have to be identical, CUP's environment must guarantee that they have a certain 
kind of consistency. Namely, each W set submitted by an initiating participant i must include all 
participants that ever initiate CUP and that do not leave or fail prior to the initj event. This 
implies that every participant must be included in its own estimated world. 

• World consistency: If initj (*, W) and initj(*,*) events occur, then either j 6 W, or a 
leavej or failj event occurs before the initj (*, W) event. 

The next property describes the correctness of the message deliveries: every message that is received 
was previously sent, and no message is received at the same location more than once. Moreover, 
the order of message receipt between particular senders and receivers is FIFO. 



1 An "event" is an occurrence of an action in a sequence. 



• Message integrity: There is a mapping from netjrcv events to preceding netjncast events, 
such that the same message in Mqjjp appears in both events, and such that no two netjrcvj 
events for the same i map to the same netjncast event. Moreover, two net jtcvj events that 
map to netjncast events of the same sender occur in the same order as the netjncast events. 

The next two properties describe assumptions about leaves and leave detection. The first says that 
leave detection is "accurate", in the sense that the occurrence of a leave_detectj(j) implies that 
j has really left; it also includes a simple well-formedness condition. The second property says that 
leaves are handled gracefully, in the sense that the occurrence of a leave_detectj(j) implies that 
i has already received any network messages sent by j prior to leaving. Thus, a leave_detectj(j) 
is an indication that i has not lost any messages from j. 

• Accurate leave detector: For any i, j E I, at most one leave_detectj(j) event occurs, and if 
leave_detectj(j) occurs, then it is preceded by a leave^. 

• Lossless leave: Assume netjncastjdn) occurs and is followed by a leave^. Then if a 
leave_detectj(j) occurs, it is preceded by netjrcvj(m). 

The final safety assumption says that failure detection is accurate. 

• Accurate failure detector: For any i, j E 7, at most one fail_detectj(j) event occurs, and if 
f ail_detectj(j) occurs, then it is preceded by a failj. 

Note that we do not have a failure assumption analogous to the lossless leave property; thus, failures 
are different from leaves in that we allow the possibility that some messages from failed processes 
may be lost. 

4.2.2 Liveness assumptions 

The first liveness assumption says that, if any process i expects another process j to participate, 
then j will actually do so, unless either i or j leaves or fails. 

• Init occurrence: If an initj (*,W) event occurs and j E W, then an initj, abstain^, leaver, 
f ailj, leavej, or f ailj occurs. 

The next assumption describes reliability of message delivery. It says that any message that 
is multicast by a non-failing participant that belongs to any of the W sets submitted to CUP, is 
received by all the non-leaving, non-failing members of all those W sets. 

• Reliable delivery: Define U = U^eii W | init^.(*, W) occurs}. If i, j E Uandnet jncastj(m) 
occurs after an initj or abstain, event, then a net jrcvj(m), leave^, f ailj, or f ailj occurs. 

The final liveness assumption says that the leaving or failure of any process that belongs to an 
initiator's W set is detected by that initiator, unless it finishes by deciding, leaving, or failing. 

• Complete leave and failure detector: If initj (*,W) occurs, j E W, and leave^ or failj 
occurs, then f ail_detectj(j), leave_detectj(j), decidej, leavej, or failj occurs. 



4.3 CUP Service Guarantees 

Now we list CUP's service guarantees. Again, we classify these as safety and liveness properties. As 
we noted earlier, CUP's safety guarantees depend only on its safety assumptions, whereas CUP's 
liveness guarantees depend on both its safety and liveness assumptions. 

Formally, each individual property is a trace property. The complete specification consists of 
two general trace properties whose respective sets of traces are defined by the following predicates: 

1. The conjunction of all the CUP safety assumptions implies all the CUP safety guarantees. 

2. The conjunction of all the CUP safety and liveness assumptions implies all the CUP liveness 
guarantees. 

4.3.1 Safety guarantees 

The first guarantee expresses well-formedness conditions saying that only participants that have 
initiated can decide, and each participant decides at most once. 

• Well-formedness: For any i E I, 

1. If decide j occurs then it is preceded by an initj. 

2. At most one decidej occurs. 

The next two guarantees are the main agreement and validity guarantees for consensus. The 
uniform agreement property says that everyone who decides agrees. The validity property has two 
parts: it says that any decision value is some participant's initial value, and moreover, that any 
participant's decision is no greater than its initial value. The latter is not a "standard" property 
for consensus but is needed for our use in Atom. 

• Uniform Agreement: For any i, j E 7, if decidej (v) and decide^ (v J ) both occur then v = v' . 

• Validity: For any i E I, if decidej (v) occurs then 

1. For some j, initj (v, *) occurs. 

2. If initj (v J ,*) occurs then v < v'. 

4.3.2 Liveness guarantees 

CUP provides one liveness guarantee, which says that any participant that initiates and neither 
leaves nor fails must eventually decide. We do not make such a guarantee for a participant that 
abstains, that is, participants that abstain need not be informed of the decision value. 

• Termination: If an initj event occurs then a decidej, leavej, or f ailj occurs. 

5 The CUP Algorithm 

In this section, we present our implementation of CUP. 



5.1 Modeling Assumptions and Conventions 

We use the I/O automaton model of Lynch and Tuttle (see, e.g., [21, Ch. 8]), using standard 
precondition/effect (guarded command) pseudo-code, augmented with one new construct: effects 
may include statements of the form trigger (a), where a is an output action. Formally, we assume 
the automaton's state contains a special FIFO buffer trigger-buffer. The trigger (a) statement 
adds a to the end of trigger-buffer. The action at the head of trigger-buffer is always 
enabled, and gets removed from trigger-buffer when it is performed. No other state changes are 
associated with action a. 

The f ailj action described in the CUP interface represents the failure of endpoint i. In terms 
of the algorithm, we interpret this to mean that once f ailj occurs, i performs no more locally 
controlled actions, and input actions have no effect on the state. We treat this as a general 
convention, and do not include event handlers for f ailj actions in our pseudo-code. 

5.2 The Algorithm 

Figures 4 and 5 contain the CUP implementation for a particular endpoint i E I. The algorithm 
includes no internal actions. Therefore, the signature consists of the actions indexed by this partic- 
ular i in the external signature of CUP (see Section 4). The message alphabet Mcup is specialized 
to the set of messages of the following forms: 

• (i,r,v,W), where i E I, r E N, v E V, and W is a finite subset of I. 

• (i,0UT,r), where i El and r E N, 

The algorithm proceeds in asynchronous rounds numbered 1,2,. . .. In each round, a process 
sends its current estimates of the value and the world (the set of active processes) to the other 
processes. Each process maintains two-dimensional arrays, value and world, in which it collects 
the value and world information it receives from all processes in all rounds. It records, in a 
variable out [r] , the other processes that it knows will not participate in round r because they 
have previously left, abstained, or decided. It also records, in a variable failed, the processes that 
it knows have failed. 

mode G { _L, running, done}, initially ± 
round G N, initially 
for each r G N + , j £ I: 

value[r,j] G V U { J- }, initially _L 

world[r, j] , a finite subset of I or ±, initially ± 
for each r G N + 

out [r] , a finite subset of I, initially {} 

failed [r] , a finite subset of I, initially {} 

Derived variables : 

for each r G N + 

out-by[r] , a finite subset of I, defined as U i , out[r'] 

failed-by[r] , a finite subset of I, defined as U / <- failed[r'] 

Figure 4: CUPj state. 
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init^(v,W) 

Ef f : if mode = ± then 

mode G- running 

round G- 1 

trigger (net_mcast^(i, l,v,W) ) 

net_mcast^ (i ,r ,v,W) where r > 2 
Pre: mode = running 

r = round + 1 

W = world [round, i] \ out [round] \ failed [round] 

// All messages for the previous round have been received. 

V j G W : value [round , j ] ^ ± 

W ^ {} A v = min{ value [round, j] | j G W} 

// No decision can be made. 

-i V j £ world [round , i] \ out [round] : 

value [round , j ] = value [round , i] A wor Id [round , j ] C world [round , i] 
Ef f : round G- r 

netjrcVj^Cj ,r,v,W) 

Eff: if mode ^ done A j ^ failed-by[r] then 

value [r , j ] <— v 

world[r,j] «— W 

abstain^ 

Eff : if mode = ± then 

mode G- done 

trigger (net_mcast^(i ,0UT) ) 

decide^ (v) 

Pre : mode = running 

value [round, i] ^± 

V j G world [round , i] \ out [round] : 

value [round , j ] = v A world [round, j] C world [round, i] 
Eff : mode G- done 

trigger (netjncast^ (i, OUT) ) 

net_rcv^(j ,0UT) 

Eff : if mode ^ done then 

let r = min {r ' G N+ | value[r',j] = ±> 
out[r] G- out[r] U {j> 

leave ^ 

Eff : mode G- done 

leave_detect^ ( j ) 

Eff : if mode ^ done then 

let r = min {r ' G N+ | value[r',j] = ±> 

out[r] G- out[r] U {j} 

f ail_detect^ ( j ) 

Eff : if mode ^ done then 

let r = min {r ' G N+ | value[r',j] = ±> 

failed [r] G- failed [r] U {j> 



Figure 5: CUPj transitions. 
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The code works as follows. When an initj(v,W) input occurs, process i triggers anet_mcast (i,l,v,W) 
to send its initial value v and estimated world W to all processes, including itself. 

For each round r > 2, process i performs an explicit net_mcastj(i,r,v,W) to multicast its 
round r value v and world W. The world W is determined to be the set of processes that i thinks are 
still active, that is, the processes in ?'s previous world that i does not know to be out or to have 
failed in round r. Process i may perform this multicast only if its round is r-1, it has received 
round r-1 messages from all the processes in W, and it is not currently able to decide. The value v 
that is sent is the minimum value that i has recorded for round r-1 from a process in W. 

When a netjrcvj(j ,r,v,W) occurs, process i puts v and W into the appropriate places in the 
value and world arrays. 

When an abstainj input occurs, process i sends an OUT message, so that other processes will 
know not to wait for further messages from it, and stops participating in the algorithm. 

Process i can decide at a round r when it has received messages from all processes in its 
world[r,i] except those that are out at round r, such that all of these messages contain the 
same value and contain worlds that are subsets of world [r , i] . The subset requirement ensures 
that processes in world [r,i] will not consider values from processes outside of world [r,i] in 
determining their values for future rounds. When process i decides, it multicasts an OUT message 
and stops participating in the algorithm. 

When a netjrcvj(j ,0UT) occurs, process i records that j is out of the algorithm starting from 
the first round for which i has not yet received a regular message from j . 

When leave, occurs, process i just stops participating in the algorithm. When leave_detect j ( j ) 
occurs, process i records that j is out; when this occurs, the lossless leave assumption ensures that 
i has already received all the messages j sent. The round that is recorded for the leave is the first 
round after the round of the last message received from j. 

Process i knows that another process has failed if it learns about the failure via a f ail_detect 
event. 

In the next section, we prove the algorithm's correctness. In Section 5.3, we show that the 
algorithm is early-deciding in the sense that the number of rounds it executes is proportional to 
the number of actual failures that occur, and does not depend on the number of participants or on 
the number of processes that leave. 

5.3 The Early-Deciding Property 

We now show that the algorithm is early-deciding in the sense that the number of rounds it executes 
is proportional to the number of actual failures that occur, and does not depend on the number of 
participants or on the number of processes that leave. 
We start with some more lemmas. 

Lemma 5.1 7/initj(*,W) occurs prior to ±n±tj, then j E W. 

Proof: The environment well-formedness assumption implies that j does not leave or fail before 
it initiates, and hence does not leave or fail before i initiates. Therefore, by world consistency, j E 
W. ■ 

Invariant 5.1 If (i,l,*,W) and (j,2,*,*) are in the Net then j E W. 

Proof: By strong induction. For the inductive step, assume that, in the final state of the exe- 
cution, (i,l,*,W) and (j,2,*,*) are in the Net. Then both initj and initj events appear in 
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the execution. If initj precedes initj, then Lemma 5.1 implies the result, so assume that initj 
precedes initj. 

Since a round 2 message from j is in the Net, a round 1 message (j ,1,*,V) is also. Then 
Lemma 5.1 implies that i 6 W. 

We claim that j does not leave or fail before the initj. Suppose for the sake of contradiction 
that it does. Then the net_mcast(j ,2,*,*) event precedes the initj. Then environment well- 
formedness implies that i does not fail or leave prior to the netjncast (j ,2,*,*) event, because it 
initiates after this event. Also, i does not abstain, because it initiates. And i does not decide prior 
to the net_mcast(j ,2,*,*) event, because that precedes the initj. Therefore, i £ failed[l]jU 
outtllj in the pre-state of the net jncast (j ,2,*,*) event, so i € world[l,j]j \ failed[l]jU 
out[l]j in that state. The precondition of netjncast implies that value [l,i]j / _L in the pre- 
state, that is, j has received a round 1 message from i before the net_mcast(j ,2,*,*). But this 
cannot happen, because initj happens after the net_mcast(j ,2,*,*). This contradiction implies 
that j does not leave or fail before the initj. Then world consistency implies that j € W, as needed. 



In the rest of this section, we consider a situation where no failures happen from some point 
onward in an execution, and where the rounds of all processes are at most r at the point where 
failures cease. The following lemma says that all round r+2 messages that are ever sent have the 
same world component. 

Lemma 5.2 Suppose that r > 0. Suppose that there is a point t in an execution such that every 

process has round < r at point t, and no fail events happen from t onward. 

If net_mcast(i,r+2,v,W) and netjncast (j ,r+2,v J ,W J ) both occur in the execution, then W = 

W>. 

Proof: We show that W C W. The other direction is analogous. 

The two sets are determined in the precondition of netjncast, as follows: 
W = world [r+l,i] j \ out[r+l]j \ f ailed [r+l]j, where the values of the last two terms are 
taken from the pre-state of netjncast (i, r+2, v,W), and 

W = world [r+1, j] j \ out [r+1] j \ failed [r+1] j, where the values of the last two terms are 
taken from the pre-state of net jncast(j ,r+2,v J ,W J ). Invariant A. 7 implies that W = world [l,i] j 
\ out -by [r+1] j \ f ailed-by[r+l] j, where the values of the last two terms are taken from the 
pre-state of net _mcast(i,r+2,v,W), and 

W = world[l,j]j \ out -by [r+1] j \ f ailed-by [r+1] j, where the values of the last two terms 
are taken from the pre-state of net _mcast(j ,r+2,v' ,W J )- 

Consider some k 6 W. The precondition of netjncast(i,r+2,v,W) implies that in the pre- 
state, value [r+1, k]j ^ _L, that is, i has received a round r+1 message from k. This means 
that k has previously sent a round r+1 message. Since (by assumption) r > 0, it follows that r 
+1 > 2, which means that k has sent a round 2 message. Invariant 5.1, applied to any state after 
both netjncast (j ,1,*,*) and net jncast(k,2,*,*) have occurred, implies that k is in the world 
component of j's round 1 message, and so k is put into world[l,j]j when that is defined. To 
prove that k G W , it suffices to show that k is never placed into either of the sets out -by [r+1] j or 
f ailed-by [r+1] j. 

First, we show that k is never placed into out-by[r+l] j. Suppose for the sake of contradiction 
that k is put into out -by [r+1] j at some point in the execution. Then consider some state that 
occurs after this has happened, and that is not before the pre-state of net jncast (i,r+2,v,W). In 
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this state, we have both value [r+1 , k] j ^ ± and k E out -by [r+1] j. This contradicts Invariant A. 4. 
Therefore, k is never placed into out -by [r+1] j. 

Second, we show that k is never placed into f ailed-by[r+l] j. Suppose for the sake of con- 
tradiction that k is put into f ailed-by[r+l] j at some point in the execution. Then k fails in the 
execution, which implies that it fails before point t. But we have already noted that k sends a 
round r+1 message during the execution. It does not send this before point t, because that would 
mean that it would reach round r+1 before point t, contradiction our assumptions. So k sends the 
round r+1 message after point t, and so it cannot fail before point t, a contradiction. Therefore, 
k is never placed into f ailed-by [r+1] j. ■ 

The next lemma says that, under the same assumptions as for the previous lemma, all the round 
r+2 messages have the same value component. 

Lemma 5.3 Suppose that r > 0. Suppose that there is a point t in an execution such that every 
process has round < r at point t, and no fail events happen from t onward. 
If net_mcast(i,r+2,v,W) and net_mcast(j ,r+2,v J ,W J ) both occur in the execution, then v = 
v' . 

Proof: Process i determines v as the minimum of all values value [r+1, k]j for all k E W, and 
process i determines v' as the minimum of all values value [r+1, k]j for all k E W. Lemma 5.2 
implies that W = W. Since values are consistent (by Invariant A. 2), the sets of values over which 
the two minima are taken are identical. Therefore, v = v ' . ■ 

Finally, we prove the main early-deciding theorem. It says that, if no failures happen from 
some point onward and the rounds of all processes are at most r when failures cease, then no CUP 
participant ever advances beyond round r +2. Since we have already proved termination, this 
implies that all active CUP participants decide by round r +2. 

Theorem 5.4 Suppose that r > 0. Suppose that there is a point t in the execution such that every 
process has round < r at point t, and no fail events happen from t onward. 
Then every process always has round < r +2. 

Proof: Lemmas 5.2 and 5.3 yield a common value and world for round r+2 messages. Fix v J and 
W J to be the common value and world, respectively. 

We show that the precondition of net_mcast(i,r+3,*,*) can never be true, which implies 
that such an event can never happen. This implies that every process always has round < r +2. 
Suppose for the sake of contradiction that the precondition of net_mcast (i,r+3,v,W) is true in 
some reachable state s, for some fixed i. 

Since the precondition holds in s, world[r+2,i] j / 1 in s, and so Invariant A.l implies 
that some (i,r+2,v J ' ,W J ') message is in the Net in s, where v J ' = value [r+2 , i] j and W ' = 
world[r+2,i] j. Since v J and W are the common value and world for round r+2 messages, this 
implies that value [r+2 , i] j = v J and world [r+2, i]j = W. 

We show that for all j E world [r+2, i]j \ out [r+2] j, value [r+2 , j ] j = value [r+2 , i] j and 
world [r+2] j C world [r+2, i] j. This suffices to show that the final precondition fails, which yields 
a contradiction. 

Fix j e world [r+2, i]j \ out[r+2]j. Since failed [r+2] j ={}, if follows that j e world [r+2, i]j 
\ out[r+2]j \ failed[r+2] j. The precondition of the netjncast then implies that value [r+2, j]j ^ 
-L in state s. Invariant A.l then implies that some (i,r+2,v J ' ' ,W J ' ') message is in the Net in s, 
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wherev ,,J = value [r+2,j]j and W J ' ' = world[r+2,i] j. Since v ' and W ' are the only value and 
world for round 2 messages, this implies that value [r+2 , j ] j = v J and world [r+2, j] j = W in 
state s. Thus, value [r+2, j] j = value [r+2 , i] j and world [r+2] j C world [r+2, i] j, as needed. 



Note that this proof does not work for the case where r=0, because of potential differences in 
the initial worlds of correct processes. Consider, for example, an execution in which no process 
ever fails, and some process, k, leaves after sending a round 1 message. Process k may be included 
in the initial world of process i but not in the initial world of another process j, if j initiates CUP 
after k leaves. In this case, i takes fc's round 1 message into account when choosing its round 2 
message, while j does not (because k is not in j's initial world). This scenario can only occur in 
round 1, because no process can send a round 2 message before j initiates. 

For the case where r = 0, the best we can state is: 

Corollary 5.5 Suppose there is a point t in the execution such that every process has round = 
at point t, and no fail events happen from t onward. 
Then every process always has round < 3. 

Proof: This is immediate from Theorem 5.4, using r = 1. ■ 

5.4 Timing Assumptions 

For the sake of analyzing the performance of the CUP algorithm, we use timed I/O automata [21, 
Ch. 23]. We can regard an ordinary I/O automaton as a special case of the timed model, in which 
arbitrary amounts of time can pass between events. All the safety results carry over to this model. 
For this analysis, we add an extra assumption: we assume that any action that is enabled either 
gets performed or gets disabled by another action, before any time passes. 

5.5 Latency Analysis 

We now analyze the algorithm's latency in executions in which there are time bounds on certain 
environment actions. We assume the following bounds: 

1. 5\ is an upper bound on message latency. That is, if a netjrcv(m) event occurs, the time 
since the corresponding netjncast (m) is at most 5\. 

2. <?2 is an upper bound on failure and leave detection time. Moreover, if a message is lost due to 
failure, then the failure is detected at most 62 after the lost message was sent. More precisely, 

(a) Assume initj(*,W) occurs with j E W and failj or leavej occurs at time t. Then 
f ail_detectj(j), leave_detectj(j), decide^, leavej, or failj occurs by time t + 62- 

(b) Define U = Ukei{W\ init^(*,W) occurs}. Assume i,j E U and net_mcastj(m) oc- 
curs at time t but no net_rcvj(m) occurs. Then f ail_detectj(j), leave_detectj(j), 
decide,, leavej, or f ailj occurs by time t + 62- 

3. £3 is an upper bound on the time difference between the initiation time of different processes. 
More precisely: 

Assume some process initiates at time t and does not fail by time t + 5\. Assume further that 
initj(*, W) occurs. Then, every process JEW initiates, abstains, leaves, or fails by time 
t + 6 3 . 

15 



In practice, the failure detection time would be at least as large as the message latency. We 
therefore assume that 62 > Si. 

We now use the above bounds on the environment to establish bounds on CUP's running times. 
The next lemma bounds the time it takes from when some process initiates CUP until all processes 
terminate round 1. 

Lemma 5.6 Assume that some process initiates CUP at time t and does not fail by time t + 6\. 
Then by time t + 62 + S3, every process that initiates either terminates round 1, or leaves, or fails. 

Proof: Let ibea process that initiates and does not leave or fail by time t + 62 + S3. We now 
show that i terminates round 1 by time t + 62 + £3. If * decides by time t + 62 + £3, then we are 
done. We therefore assume that i does not decide by this time. 

In order to terminate round 1, i has to have a round 1 message from every process j E 
world[l,i]j \ out[l]j \ f ailed [l]j. That is, for every process j 6 world[l,i]j, i has to re- 
ceive a round 1 message or an OUT message from j, or a fail_detectj(j) or a leave_detectj(j) 
event. 

Fix a process j €world[l,i] j, i.e., j is in i's initial world. Since some process initiates at time 
t, by our assumption on initiation times, j initiates, abstains, leaves, or fails by time t + £3. 

If j fails or leaves by time t + S3, then by our assumption on failure and leave detection times, 
f ail_detectj(j) or leave_detectj(j) occurs by time i + <$2 + S3 (since we assume that i does not 
decide, leave, or fail by this time), and we are done. 

Assume now that j does not fail or leave by time t + £3. Since j is in i : s initial world, j either 
initiates or abstains by this time, at which point j sends a round 1 message or an OUT message 
(resp.). Hi receives this message, i receives it by time t + S^ + Si. Hi does not receive this message, 
f ail_detect(j)j or leave_detect(j)j occurs by time t + S3 + 62- 

Since S2 > Si, we get that for every j €world[l,i] j, by time t + S3 + S2, i either receives a 
round 1 message or an OUT message from j or a fail_detect(j)j or leave_detect(j)j event 
occurs. ■ 

The following lemma bounds the duration of subsequent rounds. 

Lemma 5.7 Assume that by time t, every process that initiates CUP either terminates round r 
> 0, or decides, or leaves, or fails. Then, by time t + S2, every process that initiates CUP either 
terminates round r+l ; or decides, or leaves, or fails. 

Proof: Consider a process i that initiates CUP and does not leave or fail or decide by time t + 82- 
We now show that i terminates round r+1 by time t + 62- 

In order to terminate round r+1, i has to have a round r+1 message from every process j € 
world [r+1, i]j \ out [r+1] j \ failed [r+1] j. That is, for every process j € world [r+1, i]j, i 
has to either receive a round r+1 message or an OUT message from j, or a f ail_detectj(j) or a 
leave_detectj(j) event has to occur. 

Fix a process j €world[r+l,i] j. Process j must have initiated. By time t, j terminates 
round r+1, or decides, or leaves, or fails. If j leaves or fails by time t, then f ail_detect (j)j or 
leave_detect(j)j occurs by time t + 62- Otherwise, j sends a round r+1 message or an OUT 
message (in case it decides) by time t. If i receives this message, i receives it by time t + Si. 
Otherwise, f ail_detect(j)j or leave_detect (j)j occurs by time t + S2- Since 62 > Si, we get that 
i terminates round r+1 by time t + <$2- ■ 
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Using the two lemmas above, we get the following bound on the running time of an execution 
of CUP with r rounds. 

Lemma 5.8 Assume that some process initiates CUP at time t and does not fail by time t + Si. 
If i decides at round r > 0, it does so by time t + S3 + rS?. 

Proof: By Lemma 5.6, by time t + S3 + S2, every process that initiates CUP either terminates 
round 1, or leaves, or fails. By iterative application of Lemma 5.7, we get that by time t + S3 + S2 + 
(r — l)S2 = t + S3 + r<52, every process that initiates CUP either terminates round r, or decides, or 
leaves, or fails. ■ 

As a consequence of the above lemmas and the early-deciding theorem of the previous section 
we get the following theorem: 

Theorem 5.9 Suppose that there is a point t in the execution such that no fail events happen 
from t onward. Suppose also that some process initiates CUP by time t. Then every process that 
decides, decides by time t + S3 + 3<$2- 

Proof: Let r be the highest value of round of any process at time t. Since some process initiated 
CUP by time t, r > 0. By Theorem 5.4, every process that decides, decides at the end of round 
r+2 at the latest. 

We consider two cases. First, if r > 1, then by Invariant A. 12, every process that initiated 
CUP has either terminated round r-1 or left or failed by time t. By applying Lemma 5.7 three 
times, we get that every process that initiates CUP either terminates round r+2 or leaves or fails 
by time t + 3<$2- Therefore, in this case, every process that decides, decides by time t + 3(52. 

Next, assume that r = 1. Since some process initiates CUP by time t and does not fail, by 
Lemma 5.6, by time t + S3 + S2, every process that initiates CUP either terminates round 1, or 
leaves, or fails. By applying Lemma 5.7 twice, we get that every process that initiates CUP either 
terminates round r+2 or leaves or fails by time t + S3 + 3<$2- Therefore, in this case, every process 
that decides, decides by time t + S3 + 3<$2- ■ 

6 Environment and Model Assumptions for Atom 

6.1 Timing Assumptions 

We model time using a continuous global variable now, which holds the real time. This is a real 
variable, initially 0. We assume that it increases with derivative 1. Each endpoint i is equipped 
with a local clock, clock,, modeled by a continuous, bijective, monotonically increasing function 
from the nonnegative R to the nonnegative R. 

We assume a bound of T on clock skew, where F is a positive real number. Specifically, for 
each endpoint i, we assume that in any state of the system that is reachable \clock{ — now\ < T/2. 
That is, the difference between each local clock and the real time is at most T/2. It follows that 
the clock skew between any pair of processes is T, formally: in any reachable state, and for any two 
endpoints i and j, \clocki — clock j\ < T. 

We assume that local processing time is and that actions are scheduled immediately when 
they are enabled. Formally, when any locally controlled action of any process that is part of our 
local algorithm is enabled, then before any time passes, the action is either performed or becomes 
disabled. 
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6.2 Reliable Network Assumptions 

We assume that we are given a low- level reliable network service Net. Like DAB, Net is parame- 
terized by a message alphabet, M. 

The Net(M) signature is defined in Figure 6. The actions are the same as those of DAB, except 
that they are prefixed with net_. 

Input: 

netjoin^, net_leave^, fail^, iGI, 
net_mcast^ (m) , mGM, iGI 

Output: 

net_j oinJDK^ , net_leave_OK^ , iGI , 
net_rcv^ (m) , mGM, iGI 

Figure 6: The signature of the Net service. 

Net(M) assumes that its application satisfies the same basic safety conditions as those specified 
above for DAB(M), except that action names are preceded with net_. Assuming the application 
satisfies these conditions, Net(M) satisfies a number of safety and liveness properties. 

First, Net satisfies the basic properties specified above for DAB: join/leave integrity, message 
integrity, eventual join, and eventual leave. All of these properties are the same as for DAB, except 
that action names are prefixed with net_. 

In addition, Net guarantees FIFO delivery of messages: 

• FIFO delivery: If net_mcastj(m) occurs before net_mcastj(m , ), and netJrcv, / (m , ) occurs, 
then netjrcvjCm) occurs before net_rcv ? (m , )• 

Net(M) also satisfies the following liveness property: 

• Eventual delivery: Suppose net_mcastj(m) occurs after net_join_OKj, and no failj occurs. 
Then either net_leavej or failj or net jtcvj (m) occurs. 

Additionally, the network latency is bounded by a constant nonnegative real number A. For- 
mally, Net(M) guarantees: 

• Message latency: If net jrcvj(m) occurs, then the real time elapsed since the corresponding 
net_mcastj(m) is at most A. 

The maximum message latency of A guaranteed by Net is intended to include any pre-send 
delay at the network module of the sending process. 

Since an implementation of Net cannot predict the future, it must deliver messages within time 
A as long as no failures occur. In particular, if a message is sent more than A time before its sender 
fails, it must be delivered. 

7 The Atom Algorithm 

The Atom algorithm consists of a collection of processes corresponding to the different endpoints 
in I. It uses Net and CUP services as building blocks. It uses multiple instances of CUP. 
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7.1 Data Types 

Atom defines the constant @, a positive real number. This will represent a time slot. We assume 
that G > A. 

Recall that M represents the message alphabet of DAB. We will use M' to represent the message 
alphabet of Net. We define the message alphabet of Net in term of the alphabet of Atom: 

• Mi, the set of finite sequences of elements of M. These are the bulk messages processes send. 

• M 2 = Mi U {JOIN, LEAVE} U {CUP - IN IT} x 1} 

• M' = I x M 2 x N. 

M' is the complete message alphabet of Net. Each message contains either a bulk message (sequence 
of client messages) for a particular slot, a request to join or leave a particular slot, or a report that 
process has initiated consensus on behalf of a particular endpoint. Each message is tagged with 
the sender and the slot. 

7.2 Using the Net and CUP 

The Net service alphabet is instantiated with M' . That is, Atom uses a service Net(M') to im- 
plement the service DAB(M). Atom uses multiple instances of CUP, at most one for each process 

3- 

As before, a f ailj action causes process i to stop, f ailj actions go to all the components, i.e., 
Net and all instances of CUP (including dormant ones) , and cause all of them to stop taking any 
locally controlled actions. Since f ailj actions cannot be intercepted, we do not include them in 
the code. 

leavej actions also go directly to all the local instances of CUP, including dormant ones. 

7.3 Atom Algorithm Overview 

The algorithm divides time, and respectively, messages, into slots. As time advances, each process 
advances through slot. The duration of a slot is 0. 

Each process multicasts all of its messages for a given slot in one bulk message. This is a 
useful abstraction that we make in order to simplify the presentation and analysis of the Atom 
algorithm. In practice, the bulk message does not have to be sent as one message; a standard 
packet assembly /disassembly layer can be used to provide all-or-nothing behavior. 

Message delivery is also done in order of slots. Before delivering messages of a certain slot s, 
each process has to determine the membership of this slot, that is, the set of processes from which 
to deliver messages in this slot. To ensure total order, all the processes that deliver messages for a 
certain slot have to agree upon the membership of each slot. For each slot, messages are delivered in 
the order of process indices, and for each process, the messages are unpacked from its bulk message 
and delivered in FIFO order. 

7.4 Signature 

The signature of Atom at process i, Atom,, is presented in Figure 7. It includes all the interaction 
with the client and all the interaction with the underlying network. The implementation of Atom 
uses CUP as a building block. Hence Atonij has additional input and output actions for interacting 
with CUP. Since Atom uses multiple instances of CUP, at most one for each process j, actions of 
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CUP automata are prefixed with CUP(j). For example, process i uses the action CUP(j) . initj to 
initiate the CUP automaton associated with process j. CUP. fail and CUP. leave are not output 
actions of Atom, since they are routed directly from the environment to all instances of CUP. 

The signature of Atomj also includes two internal actions, encLslot, and members. These two 
actions play a role in determining the membership for each slot. encLslot (s) j occurs at a time by 
which slot s messages from all processes should have reached process i. At this point, processes 
from which messages are expected but do not arrive are suspected to have failed. For each suspected 
process j, CUP(j) is run to have the surviving processes agree upon j's failure slot. This is needed 
because failed processes can be suspected at different slots by different surviving processes. After 
CUP reaches decisions about all the suspected processes that could have failed at slot s, members (P , 
s) can occur, with P being the agreed membership for slot s. When process i performs members (P, 
s)j, all the messages included in bulk messages that i received for slot s from processes in P are 
delivered (their delivery is triggered) in order of process indices. 

Input: 

join^, leave^ 

net_j oinJDK^ , net_leave_OK^ 

mcast^ (m) , mGM 

net_rcv^ (m) , mGM ' 

faili 

CUP ( j ) . decidei (v) , vGN 

Output: 

j oin_0K^ , leaveJDK^ 

net_j oin^ , net_leave^ 

net_mcast^ (i, m, s) , 111GM2 , s€N 

rcv^ (m) , mGM 

CUP(j) .initiO, W) , veN 

CUP(j) .abstainj 

CUP(j) .leave_detect i (j) , jGI 

CUP(j) .fail_detect i (j) , jGI 

Internal: 

end_slot^(s) , s£N 

members^ (P, s) , P set of I, s€N 

Figure 7: Atomj: Signature. 



7.5 Pseudo-code 

The Atom, code is presented in Figures 8-10. The state components are presented in Figure 8. 

Recall that we do not assume that processes execute the algorithm from the beginning of 
time. Rather, the application issues an explicit join event, and waits for a join_0K. The variable 
join-slot holds the slot at which a process starts participating in the algorithm; this will be the 
value of current-slot when join_0K will be issued, and the first slot for which a bulk message 
will be sent. If a process explicitly leaves the algorithm, its leave-slot holds the slot immediately 
following the last slot in which the process sends a bulk message. Both join-slot and leave-slot 
are initially 00, so as to be larger than any actual slot number they are compared with. 
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clockGR, initiallyG[0, T/2] ; dynamic type: continuous functions 

join-slot € N U oo , initially oo 
leave-slot £ NU oo, initially oo 
did-join-OK, boolean, initially false 
did- leave, boolean, initially false 

mcast-slots C N, initially {} 
ended-slots C N, initially {} 
reported-slots C N, initially {} 

for every s G N 

out-buf [s] G M2, initially empty sequence of M 
joiners [s] C I, initially {} 
leavers [s] C I, initially {} 
suspects [s] C I, initially {} 

for every s G N, j G I 

in-buf [j,s], j G I, s G N, finite sequence of M or ±, initially ± 

for every j G l\{ i } 

CUP-status [j] G { idle, req, running, done }, initially idle 
CUP-req-val[j] G N U { ± >, initially ± 
CUP-dec-val[j] G N U { ± >, initially ± 

derived variables: 

current-slot G N = [ clock / © J 

for every s G N 

alive[s] C I = { j | in-buf [s,j] ^ J. } 

Figure 8: Atom,: State. 

The boolean flags did-join-OK and did-leave are used to ensure that join_0K and net_leave 
actions will not be performed more than once. The set mcast-slots keeps track of the slots for 
which the process already multicast a message (JOIN, LEAVE, or bulk). Likewise, ended-slots 
and reported-slots keep track of the slots for which the process already performed the end_slot 
or members actions, resp. 

out-buf [s] stores the message (bulk, JOIN, or LEAVE) that is multicast for slot s; it initially 
holds an empty sequence, and in an active slot, all application messages are appended into it. 
A JOIN message is inserted for the slot before the join-slot, and a LEAVE message for the 
leave-slot. Either way, there is no overlap with a bulk message. 

The variables joiners [s] and leavers [s] keep track of the processes j for which join-slot j 
=s (resp. leave-slotj =s). suspects [s] is the set of processes suspected in slot s as determined 
when end_slot(s) occurs. 

The variable in-buf [ j , s] is a finite sequence of messages received in a slot s bulk message 
from process j. The data type finite sequence supports assignment, extraction of the head of the 
queue, and testing for emptiness. 

There are three variables for tracking the status and values of the different instances of CUP. 
CUP-status [j] is initially idle; when CUP(j) is initiated, it becomes running; if a CUP-INIT 
message for j arrives, it becomes req; and when there is a decision for CUP(j), or if the process 
abstains from CUP(j), it becomes done. CUP-req-val[j] holds the lowest slot value associated 
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with a CUP-INIT message for j (± if no such message has arrived). Finally, CUP-dec-val [j] holds 
the decision reached by CUP(j), and _L if there is none. 

alive [s] is a derived variable, storing the set of processes from which slot s bulk messages 
were received. 

joinj 

Eff: trigger (netjoin^) 

net_join_0K^ 

Eff: join-slot i- current-slot + 2 + \ T/Q ] 
out -buf [join-slot - 1] <- JOIN 

joinJDK^ 

Pre: did-join-OK = false 

current-slot = join-slot 
Eff: did-join-OK f- true 

leave^ 

Eff: if (join-slot G N) then 

leave-slot <— max (current-slot, join-slot) + 1 

out-buf [leave-slot] f- LEAVE 

netJLeave^ 

Pre: did- leave = false 

leave-slot G mcast-slots 
Eff : did-leave <— true 

net_leave_OK 

Eff: trigger (leaveJDK^) 

mcast^ (m) 

Eff: if (join-slot < current-slot < leave-slot) then 
append m to out-buf [current-slot] 

net_mcast^ (i, m, s) 
Pre: join-slot G N 

join-slot - 1 < s < leave-slot 

current-slot = s+1 

s $ mcast-slots 

m = out-buf [s] 
Eff: mcast-slots <— mcast-slots U { s } 

net_rcv^(j, JOIN, s) 

Eff: joiners [s+1] <— joiners [s+1] U { j } 

net_rcv^(j, LEAVE, s) 

Eff : leavers [s] <— leavers [s] U { j } 

foreach (k such that CUP-status[k] = running) do 
trigger (CUP (k) . leave_detect ^ ( j ) ) 

net_rcv^(j, m, s) , m sequence of M 
Eff : in-buf [j , s] f- m 

Figure 9: Atom,: Transitions related to multicast, join, and leave. 

In Figure 9 we present the first part of Atom's transitions, including transitions related to 
joining, leaving, multicasting messages, and receiving messages from the network. Transitions 
related to membership and totally ordered delivery are presented in Figure 10. 
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When the application issues a join, Atom triggers net_join. Once the Net responds with a 
net_join_0K, Atom calculates the join-slot to be 2 + |T/Q] slots in the future. This will allow 
enough time for the join message to reach the other processes. A JOIN message is then inserted 
into out-buf [join-slot-1] . Once the current-slot reaches join-slot, join_0K is issued to the 
application. 

When the application issues a leave, the leave -slot is chosen to be the ensuing slot, and a 
LEAVE message is inserted into out-buf [leave-slot] . A net_leave is issued after the LEAVE 
message has been multicast, and the net_leave_OK triggers a leaveJDK to the application. 

Messages multicast by the application are appended to the bulk message for the current slot 
in out-buf [current-slot] . Once a slot s ends, the message pertaining to this slot is multicast 
to the other processes using net_mcast. If s = join-slot - 1, a JOIN message is sent. If s = 
leave-slot, a LEAVE message is sent, and if s is between join-slot and leave-slot - 1, a 
bulk message is sent. 

When a bulk message is received, it is stored in the appropriate in-buf . When a JOIN (LEAVE) 
message is received, the sender is added to the joiners (resp. leavers) set for the appropriate slot. 
Additionally, when a LEAVE message is received, CUP.leave_detect is triggered for all running 
instances of CUP. 

Process i performs encLslotj (s) once it should have received all the slot s messages sent by 
other non-failed processes. Since slot s messages are sent immediately when slot s ends, messages 
are delayed at most A time in Net, and the clock difference is at most T, process i should have 
all the non-failed processes' slot s messages A + T time after slot s+1 began. At this time, clock 
> (s + 1)0 + A + T. Process i expects to receive slot s bulk messages from all the processes that 
are in alive [s-1], except for those that are leaving in slot s. Any process from which a slot s 
bulk message is expected but does not arrive becomes suspected at this point, and is included in 
suspects [s] . 

For every suspected process, CUP is run in order to agree upon the slot at which the process 
failed. The slot s in which the process is suspected is used as the initial value for CUP. The estimated 
world for CUP is alive [s] U joiners [s+1] . This way, if k joins in slot s+1, k is included in 
the estimated world. This is needed in order to satisfy the world consistency assumption of CUP, 
because k can detect the same failure at slot s+1, and therefore participate in CUP(j). When i 
initiates CUP(j), it also multicasts a (CUP-INIT, j) message. If a process k does not detect the 
failure and does not participate, the (CUP-INIT, j) message forces k to abstain. 

Since Atom implements the failure detector for CUP, the effect of encLslotj (s) also triggers 
CUP(k) .f ail_detect(j) actions for every suspected process j, and for every currently running 
instance k of CUP. 

Process i abstains from CUP(j) only if a (CUP-INIT,j) message has previously arrived, setting 
CUP-status[j] j = req, and only if encLslotj has already occurred for a slot value greater than 
CUP-req-val[j] j. The latter condition ensures that i abstains only from instances of CUP that it 
will not initiate. This is because the network guarantees that when a process fails, at most one slot 
bulk message from this process is lost (since we assume that Delta < O). This implies that the 
detection of j's failure by two non-failed processes can occur at most one slot apart. Therefore, if 
encLslotj has already occurred for a slot value greater than CUP-req-val[j] j, i will never suspect 

3- 

The members (P , s) action triggers the delivery of all slot s messages from processes in P. It can 
only occur once agreement has been reached about the processes to be included in P. Since the slot 
at which a process k is suspected by two processes i and j can differ by at most one, memberSj(P, 
s) can occur after i receives decision from all instances of CUP pertaining to processes suspected in 
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end_slot^(s) 

Pre: join-slot < s 

leave-slot = oo 
s $ ended-slots 
clock ) (s+l)6 + A + T 
Eff : ended-slots <— ended-slots U { s } 

suspects [s] <— (alive [s-1] U joiners [s] \ leavers [s] ) \ alive [s] 
foreach (j G suspects [s] ) do 

trigger (CUP(j) . init^(s, alive[s] U joiners [s+1] ) ) 
net_mcast i (i, (CUP-INIT, j), s) 
CUP-status [j] <— running 

foreach (k such that CUP-status [k] = running) do 
trigger (CUP (k) .f ailjletect i (j ) ) 

net_rcv i (j, (CUP-INIT, k) , s) 

Eff: if (CUP-status [k] = idle V CUP-req-val [k] ) s) then 

CUP-status [k] <r- req 

CUP-req-val [k] <r- s 

CUP(j) .abstain i 

Pre: CUP-status [j] = req 

3s G ended-slots : s > CUP-req-val [j ] 
Eff: CUP-status [j] <r- done 

CUP(j) .deciders) 
Eff: CUP-status [j] <r- done 
CUP-dec-val[j] <r- s 

members^ (P, s) 

Pre: s = min{ ended-slots \ reported-slots } 
s + 1 G ended-slots 

Vj £ (suspects [s] U suspects [s+1] ) : CUP-status [j] = done 
P = { j G alive [s] | CUP-dec-val[j] =± V CUP-dec-val [j] > s } 
Eff: reported-slots <— reported-slots U { s } 
foreach j G P, in order of indices do 
while in-buf [ j , s] not empty do 

trigger (rcv^ (head (in-buf [i , s] ) ) ) 

Figure 10: Atorrij: Transitions related to membership and message delivery. 

slots up to s+1. Therefore, members, (P, s) must occurs after end_slot (s+1) , when the suspicions 
for slot s+1 are determined. The set P includes every process j that is alive in slot s and for which 
there is either no CUP instance running (in which case j was not suspected) , or the CUP decision 
value is greater than s. 

7.6 Latency Analysis 

In this section we analyze the latency guarantees of Atom. In Section 7.6.1 we show that in failure 
free executions, Atom's message latency is bounded by A + 20 + 2T. We denote this bound by 
^■Atom- I n Section 7.6.2, we assign values to the constants that were used in the analysis of CUP 
in Section 5.5 (<$i, 62, and 63). Then, in Section 7.6.3, we consider executions in which failures do 
occur but there is a long time period with no failures. We analyze the time it takes Atom to clear 
the backlog it has due to past failures, and reach a situation in which message latency is bounded 
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by the same bound as in failure free executions, namely A^ tom , barring additional failures. 

The fact that once failures stop for a bounded time all messages are delivered within constant 
time implies that in periods with / failure, Atom's latency is at most linear in the number of failing 
processes. 

7.6.1 Failure free executions 

Lemma 7.1 The time from when process j starts slot s (i.e., current-slotj becomes s) until 
process i performs end_slotj(s+l) is at most A + 20 + 2I\ 

Proof: According to its preconditions, end_slotj(s) occurs 20 + A + T time after i starts slot 
s. Since the difference between two processes' clocks is at most T, i starts slot s at most F time 
after j starts this slot. ■ 

Lemma 7.2 Consider an execution in which no process fails. If the application at process j per- 
forms mcastjCm) when current-slotj = s and if process i delivers m, then i delivers m immedi- 
ately after end_slotj(s+l) occurs. 

Proof: If « delivers m, rcvj(m) is triggered during the MemberSj(P,s) action. Since no process 
fails, suspects [s]jU suspects [s+l]j is an empty set, and thus the only precondition that needs 
to be satisfied in order to perform Members^ (P, s) is s+1 € ended-slotSj, which is true immediately 
after end_slotj(s+l) occurs. ■ 

As a direct result of these two lemmas, we get the following theorem: 

Theorem 7.3 If the application at process j performs mcastj(m) at timet, and if process i delivers 
m, then i delivers m by time t + A^ tom = t + A + 20 + 2T. 

7.6.2 CUP bounds 

We now assign values to the constants used in the analysis of CUP in Section 5.5. Recall, 5\ is an 
upper bound on message latency; 82 is an upper bound on failure and leave detection time, and if 
a message is lost due to failure, then the failure is detected at most 62 after the lost message was 
sent; and £3 is an upper bound on the difference between different processes' initiation times. 

Lemma 7.4 6\ = A 

Proof: By definition, both A and 6\ are defined to be upper bounds on the underlying network 
latency. ■ 

Lemma 7.5 6 2 = A + 30 + 2T 

Proof: Assume that CUP(k) . initj(*,W) occurs with j 6 W. Assume that one of the follow- 
ing happens at time t: ±a±lj, leave^, or net_mcastj(m) for a message m that is lost because j 
subsequently fails. Let s be the value of current-slotj at time t. Assume also that by time 
t + A + 30 + 2r, i does not decide, leave, or fail, so CUP-status [k] j = running and i is active at 
this time. We have to show that by this time, f ail_detectj(j) or leave_detectj(j) occurs. 

If j fails at time t, then j's slot s message is never sent, and therefore i detects the failure and 
invokes CUP(k) .f ail_detectj(j) during end_slotj(s) at the latest. By Lemma 7.1, this occurs 

25 



by time t + 20 + A + 2T. Likewise, if j sends a message ra while current-slot^ = s, and ra is 
lost, then by the FIFO nature of the network, j's slot s message is also lost and i detects j's failure 
during encLslotj(s) at the latest. 

Assume next that j leaves when current-slotj = s, i.e., j's leave-slot is s+1. If * receives 
a LEAVE message from j, it receives it before end_slotj(s+l) occurs, and immediately triggers 
CUP(k) . leave_detectj(j). Otherwise, i receives no slot s+1 message from j and suspects j and 
invokes CUP(k) .f ail_detectj(j) during end_slotj(s+l). This occurs by time t + 39 + A + 2T. 



Lemma 7.6 S 3 = T + G 

Proof: Assume that some process process I initiates CUP(k) at time t and does not fail by time 
t + A. Assume further that CUP(k) . initj(*, W) occurs with j 6 W. We have to show that j 
initiates, abstains, leaves, or fails by time t + T + @. 

Process I triggers CUP(k) .init;(s, *) during the end_slot;(s) action, and k 6suspects^[s] . 
If j initiates CUP(k), there is a slot s ' such that j triggers CUP (k) . init j during the end_slot j (s ' ) 
action, and k esuspects^ [s J ] . By Invariant A. 19, s' < s+1. Therefore, CUP(k) .initj occurs 
no later than time t + T + 0, and we are done. 

Assume now that j does not initiate CUP(k), and does not leave or fail by time t + T + @. We 
now show that j abstains from CUP(k) by time t + F + @. 

When CUP(k) . init/(s , *) is triggered, / multicasts a (CUP-INIT, k) message. By Lemma A. 8, 
net_join_OKj occurs before / initiates CUP(k), that is, before / multicasts this message. Moreover, 
by assumption, / does not fail by time t + A and j does not leave or fail by time t + A (because 
A < @). Therefore, j receives this message by time t + A, which is before time t + T + @. After j 
receives this message, CUP-status [k]j is req and CUP-req-val [k] j is less than or equal to s. By 
time t + T + 0, end_slotj(s+l) occurs and the condition for CUP(k) . abstain^ becomes true, and 
remains true until CUP(k) .abstain^ occurs and changes CUP-status [k]j. Therefore, before any 
time passes, CUP(k) .abstain^ occurs. ■ 

7.6.3 Failure free periods 

We now consider executions in which failures do occur but there are long time periods with no 
failures. We analyze the time it takes Atom to clear the backlog it has due to past failures, and 
again reach a situation in which message latency is bounded by A^ tom , barring additional failures. 

Let t\ = 63 + 4<5 2 , where £3 and 62 are bounds as given above for the difference between 
process initiation times and failure detection time, resp. From Lemmas 7.6 and 7.5 we get that 
i x = T + G + 4(A + 3G + 2r) = 4A + 9r + 13G. 

Assume that from time t to time t' = t + ti there are no failures. We now show that if a message 
m is sent after time t', and there are no failures for a period of length A^ tom after ra is sent, then ra 
is delivered within A^ tom time of when it is sent. Since the delivery order preserves the FIFO order, 
this also implies that any message ra' sent before time t' is delivered by time t' barring failures in 
the A Atom time interval after ra' is sent. 

Theorem 7.7 Assume no process fails between time t and t' — t + t\. If mcast (m)j occurs at a 
time t" such that t + t\ < t" , and no failures occur from time t" to time t" + A^ tom , and if i delivers 
ra, then i delivers ra by time t" + A Atom- 
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Proof: By Lemma 7.5, by time t + 62 all the processes detect all the failures that occur by time t. 
Therefore, no process initiates an instance of CUP after time t+62- Since no failures occur after time 
t + 62, by Theorem 5.9, all CUP instances that i initiates terminate by time t + ^2 + ^3 + 3<?2 — t + t\. 
Let s be the value of current-slotj at time t" (i.e., when mcast (m)j occurs). By Lemma 7.1, 
process i performs encLslot j (s+1) by time t" + A + 20 + 2T = t" + IS. Atom- At this time, there are 
no active CUP instances, because CUP instances pertaining to failures that occurred before time t 
have all terminated and no new failures occur until time t" + A Atom- Therefore, for every slot s' 
< s, in order of slot numbers, Members (P, s')i becomes enabled until it occurs. So Members (P, 
s)j occurs before any time passes. If * delivers ra, rcvj(m) is triggered during the MemberSj(P, s) 
action, so rcvj(m) also occurs before any time passes. ■ 

7.7 Extending Atom to Cope with Late Messages 

In this paper, we assumed a synchronous model with deterministic network latency guarantees. 
Since the network latency, A is expected to be of a smaller order of magnitude than 0, it would 
not significantly hurt time bounds if conservative assumptions are made in the choice of A. 

In ongoing research we are considering networks where latency bounds are more likely to be 
violated. For example, some networks may support differentiated services with probabilistic latency 
guarantees. Moreover, loss rates may exceed the bounds assumed in the implementation of the 
reliable network. Such networks can be represented using the timed-asynchronous [11] failure 
model. 

Although our algorithm cannot guarantee atomic broadcast semantics while network latency 
and reliability guarantees are violated, it is important for the algorithm to be able to recover 
from such situations, and to once more provide correct semantics after network guarantees are re- 
established. In addition, it would be desirable to inform the application when a violation of Atom 
semantics occurs, and when the correct semantics are resumed (following the failure awareness 
approach of [14]). 

There are some strategies that can be used to make Atom recover from periods in which network 
guarantees are violated. For example, a lost or late message can cause inaccurate failure suspicions. 
With Atom, if a process k is falsely suspected, it will receive a (CUP-INIT, k) message for itself. 
In order to recover from such a situation, we could have the process "commit suicide" in such a 
situation, that is inform the application of the failure and have the application re-join as a new 
process. The full modification of Atom for this setting is ongoing work. 



8 Conclusions 

We have defined two new problems, Dynamic Atomic Broadcast and Consensus with Unknown 
Participants. We have presented new algorithms for both problems. The latency of both of our 
algorithms depends linearly on the number of failures that occur during a particular execution, but 
does not depend on an upper bound on the potential number of failures, nor on the numbers of 
joins and leaves that happen during the execution. 
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A Correctness Proofs 

A.l Correctness of the CUP Algorithm 

We consider the system consisting of a composition of automata CUPj, one for each i E I. We 
consider a restricted set of executions of this composition — those in which the environment safety 
assumptions are all satisfied. The invariants we state throughout this section should be interpreted 
as saying that the stated property is true for all states that occur in such executions. 

A. 1.1 General invariants 

We say that a message is in the Net if a net_mcast event for that message has occurred or is in a 

trigger-buffer. 

The first invariant lists an assortment of basic constraints. They can be proved using induction. 

Invariant A.l 1. value [r,i]j = _L if and only if world [r , i] j = _L 

2. If value [r , i] j = v ^ _L and world [r,i]j = K /I then an (i,r,v,W) message is in the 
Net. 

3. If (i,r,*,*) is in the Net then rounds > r. 
4- 7/ modej = _L then round, = 0. 

5. 7/ modej = running then some (i,l,*,*) message is in the Net. 

6. If i E failed [r]j then failj has occurred. 

7. If i E failed [r]j and s > r ; then value [s,i]j = ±. 

The next invariant expresses consistency of values and worlds of the same process at different places 
in the system. 

Invariant A. 2 1. If messages (j,r,v,W) and (j,r,v',W) are in the Net then v = v J and W 
= W J . 

2. 7/ value [r , j ] j ^± and value [r , j ] ^ ^± then value [r,j]j = value [r , j ] ^ . 

3. 7/ world [r , j ] j ^ _L and world [r,j]j' ^ _L then world [r , j] j = world [r, j] j/. 

4- 7/ value [r , j ] j ^ ± and world[r,j]j ^ ± and a message (j,r,v,W) is in the Net then 
value [r,j]j= v and world [r,j]j = W. 

The next two invariants describe some facts that follow from the existence of OUT messages and 
from the detection of leaves. 

Invariant A. 3 1. If an (i,0UT) message is in the Net then modej = done. 

2. If i E out[r]j then modej = done. 

Proof: By induction. Part 2 uses the accurate leave detector assumption. ■ 

Invariant A.4 If i E outLr]^ and s > r, then no message of the form (i,s,*,*) is in the Net, 
and for all j, value [s,i]j = _L 
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Proof: By strong induction. First, we claim that a net_mcastj event cannot convert the invariant 
from true to false by falsifying the conclusion while leaving the hypothesis true. This is because, 
if the hypothesis is true, then i E out [r]^ in the pre-state of the net_mcastj, which implies, by 
Invariant A. 3, that modej = done. But the precondition of netjncastj requires that mode, = 
running, a contradiction. 

The key steps are, therefore, those that make the hypothesis true. Index i can be added to 
out [r]fc by receipt of an OUT message by A; or by a leave_detect^(i). An OUT message may 
result from a previous abstainj that occurs when modej = _L, or a previous decidej event. 

For abstainj, by Invariant A.l, we know that in the pre-state of the abstainj, roundj = 0. 
Then Invariant A.l implies that in the pre-state, no message of the form (i, *,*,*) is in the Net, 
and for all j and all s, value [s,i] j = _L. Once the abstainj happens, modej becomes done, which 
means that no later messages are sent. 

For decidej, the FIFO assumption for message delivery implies that the decidej event must have 
occurred when round = r —1. Invariant A.l then implies that in the pre-state of the decidej, the 
conclusion of the invariant holds. Since the decidej event sets modej to done, i sends no further 
messages, so the conclusion continues to hold. 

For leave_detectfc (i) , we know by the lossless leave assumption that before the leave_detect^ (i) 
occurs, k has already received every message that has ever been net_mcast by i. Since k explicitly 
checks that it has no values from i for round r, there are no such messages in the Net. ■ 

The following says that any value that appears anywhere in the system is some participant's initial 
value. 

Invariant A. 5 1. If (i,r,v,W) is in the Net then there exists j and W J such that (j ,l,v,W) 
is in the Net. 

2. If value [r , k] j = v ^ _L then there exists j andW such that (j,l,v,W) is in the Net. 

Proof: We show Parts 1 and 2 together by induction on the length of a finite execution. 
Base: Trivial, because no messages are initially in the Net and no values are initially non-±. 
Inductive step: We first show part 1. The interesting steps are those in which a message (i,r,v,W) 
is put into the Net. If r = 1 then (i,r,v,W) is put into the Net by an initj(v,W) event, which 
puts the net_mcast into trigger-buff erj. But this immediately satisfies the conclusion. On the 
other hand, if r > 2, then (i,r,v,W) is put into the Net by an explicit net_mcast(i,r,v,W) step. 
In this case, v is obtained from a set of values already in «'s value array in the pre-state. The 
induction hypothesis, part 2, then implies that some (j ,l,v,W J ) is already in the Net, as needed. 
For part 2, the key step is a net jtcvj (k,r ,v, W) for some W. In the pre-state of such a step, mes- 
sage (k,r ,v,W) is in the Net. The inductive hypothesis, part 1, then implies that some (j , 1 ,v,W J ) 
is already in the Net, as needed. ■ 

The following invariant asserts that processes are always in their own worlds. 
Invariant A. 6 1. If a message (i,r,v,W) is in the Net then i£ W. 
2. If world [r , i] j ^ _L then i € world[r,i]j. 

Proof: We prove part 1 by induction on the length of the execution, with a trivial base case. 
Inductive step: The interesting steps are those in which a message (i,r,v,W) is put into the Net. 
If r =1, then this is done by an initj(v,W) step. In this case, the environment well-formedness 
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assumption implies that no leave, or f ailj event precedes the initj, and so the world consistency 
assumption implies that i E W, as needed. On the other hand, if r > 2, then (i,r,v,W) is put into 
the Net by an explicit net_mcast (i,r,v,W) step. In this case, the precondition says that modej = 
running and world [l,i]j ^ _L in the pre-state. In this pre-state, i is not in any failed [r]j set, 
because if it were, Invariant A.l would imply that i has failed, and it would not be able to perform 
the net_mcast. Also, in this pre-state, i is not in any out [r]j set, by Invariant A. 3. Therefore, i 
is included in W, because of the way W is defined. 

Part 2 follows follows from part 1 and Invariant A.l. ■ 

The following invariant describes consequences of the definition of a round r+1 world and value: 

Invariant A. 7 If (i,r+l,v,W) is in the Net, for r > 1, then: 

1. For every j E W, world [r,j]j / _L. 

2. W = world [r,i]j \ out[r]j \ failed [r]j. 

3. v = min { value [r,j]j : j E W}. 

4- W = world[l,i]j \ out-by[r]j \ f ailed-by [r] j. 

Proof: Part 1 is proved by an easy induction; the key step is net_mcast (i,r+l,v,W), and the 
conclusion follows immediately from the precondition. 

Given part 1, we prove part 2 by induction. Now the interesting steps arenet_mcastj(i,r+l,v,W), 
net_rcvj(j ,0UT), leave_detectj(j), andf ail_detectj(j). The fact that net_mcastj(i,r+l,v,W) 
yields the property follows immediately from the precondition. A net jrcvj(j ,0UT) event could 
only falsify the property if j € W and the event puts j into out [r] j . However, part 1 implies 
that world [r,j]j ^ _L in the post-state, and hence value [r,j]j ^ _L in the post-state. But 
this would cause the post-state to violate Invariant A. 4, a contradiction. A similar argument 
shows that leave_detectj(j) cannot falsify the property. Finally, f ail_detectj(j) could only 
falsify the property if j € W and the event puts j into failed [r]j. However, part 1 implies that 
world [r, j]j ^ _L in the post-state, and hence value [r, j]j ^ _L in the post-state. But this would 
cause the post-state to violate Invariant A.l, a contradiction. 

We prove part 3 by induction, using part 1. This time, the interesting steps arenet_mcastj(i,r+l,v,W) 
and net jrcvj(j ,r,*,*). Again, the net_mcastj(i,r+l,v,W) step yields the property immedi- 
ately. A netjrcvj(j ,r,v J ,*) could only falsify the property if j E W. But in this case we know 
that value [r, j], / _L i the pre-state, and then Invariant A. 2 implies that v J = value [r,j]j in 
the pre-state. It follows that this step does not change value [r,j]j, and so does not falsify the 
property. 

Part 4 is proved by induction on r (not induction on the length of the execution), using part 
2. The base case, r = 1, follows immediately from part 2. For the inductive step, we sup- 
pose that the claim is true for some r > 1 and show it for r + 1. That is, we assume that 
(i,r+2,v,W) is in the Net. Then by part 2, W = world[r+l,i] j \ out[r+l]j \ f ailed[r+l] j. 
Now, since world[r+l,i] j ^ _L, Invariant A.l implies that a message of the form (i.r+ljV* ,W J ) 
is in the Net, where W J = world [r+1, i]j. By inductive hypothesis, part 4, this implies that W J 
= world [l,i]j \ out-by[r]j \ f ailed-by [r]j. Therefore, W = world [l,i]j \ out-by[r]j \ 
f ailed-by [r]j \ out[r+l]j \ f ailed[r+l] j. This is equal to world [l,i]j \ out-by[r+l]j \ 
f ailed-by [r+1] j, as needed. ■ 

Invariant A. 8 Suppose that decidej(v) has happened at round r. Then: 
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1. For all j E world [r,i]j \ out [r]j, value [r,j]j = value [r , i] j and world [r,j]j C world[r,i]j. 

2. For all j E world [r,i]j ; if (j ,r,v J ,W) is in the Net then v' = v and W C world[r,i] j. 

Proof: Part 1 follows from an easy induction: out can only grow, and value and world do not 
change once they are non-±. Therefore, the only interesting step is decidej(v), and the result 
follows directly from the precondition. 

For part 2, consider any state s after a decidej(v) has happened at round r. Suppose that 
j E world [r,i] j and (j ,r,v J ,W) is in the Net, in state s. Then Invariant A. 4 implies that j $. 
out [r]j. Thus, j E world [r , i] j\ out[r]j. Then part 1 and Invariant A. 2 imply the conclusion. 

■ 

The following invariants say that any process' values and worlds decrease as rounds increase. 

Invariant A. 9 For any r > 1, if a message (i,r+l,v,W) is in the Net then value [r,i]j ^ A., v 
< value [r,i]j, and W C world[r,i] j. 

Proof: By induction. For the inductive step, the interesting case is when the last action of the 
execution is net_mcast (i,r+l,v,W). Invariant A. 6 implies that i E W. Therefore, the precondition 
for net _jncast(i,r+l,v,W) implies that, in the pre-state, value [r,i] j / ±. Therefore, this is also 
true in the post-state, as needed. 

Next, we show that v < value [r ,i] j. The value v is determined in the net_mcast event to be 
the minimum of the set of values of the form value [r , j] j, for j E W. Since i E W, this minimum 
includes value [r,i]j. Therefore, v < value [r,i]j. 

Finally, we show that W C world [r,i] j. The value W is determined in the net_mcast event to 
be world [r,i]j \ out -by [r] \ failed-by [r] , according to the values of the out and failed sets 
in the pre-state. It follows immediately that W is a subset of world [r,i]j. ■ 

Invariant A. 10 For any r > 1, 

1. If value [r+l,i] j ^ _L then value [r,i] j ^ A., value [r+l,i] j < value [r,i]j ; andworld[r+l,i] , C 
world [r,i] j. 

2. // value [r , i] j ^ _L and 1 < s < r then value [s,i]j ^ A., value [r,i]j < value [s,i]j ; 
and world [r,i]j C world[s,i]j. 

Proof: For part 1, assume that, in some reachable state, value [r+1 , i] j ^ _L, and hence world [r+1 , i] j 
/ _L. Then Invariant A.l implies that in the same state, a message (i,r+l,v,W) must be in the 
Net, where v = value [r+1, i]j and W = world [r+1, i] j. Invariant A. 9 then yields the conclu- 
sions. Part 2 follows from part 1, using induction on r-s. ■ 

The following invariant says that, if all the messages for a particular round r are "consistent", then 
so are all the messages for all later rounds. 

Invariant A. 11 Let W be a nonempty finite set, v E V, and r > 1. Suppose that, for every i E W, 
if a message of the form (i,r,v J ,W J ) is in the Net, then W J C W and v J = v. 
Then for every i, j E W and for every s > r, 

1. If a message of the form (i,s,v J ,W J ) is in the Net, then W C W and v = v' . 

2. value [s,i]j is either v or A.. 
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3. world [s,i]j is either a subset o/W or _L. 

Proof: We prove part 1 by induction on the length of an execution. 
Base: The conclusion of the invariant is vacuously true in the start state. 

Inductive step: The interesting steps are those that put some message (i,s,v J ,W J ) into the Net, 
where i 6 W and s > r. We may restrict attention to the case where s > r, because if s = r and 
the step falsifies part 1, it also falsifies the hypothesis of the invariant. Thus, the only interesting 
steps are of the form net_mcast (i , s , v J , W J ) where i 6 W and s > r. So consider such a step, and 
fix i, s, v J , and W. Assume that the hypothesis of the invariant is true after (and hence before) 
the step. 

We show that W C W. After the net_mcast step, the message is in the Net. Invariant A. 9 
then implies that W C world [s-l,i] j. Invariant A. 10 then implies that world [r,i]j / _L 
and world [s-1, i]j C world[r,i]j. Therefore, W C world [r,i]j. Since world [r,i]j ^ _L, 
an (i,r,*,W J ') message is in the Net, where W ' = world[r,i] j. Then since the hypothesis of 
the invariant is true, it follows that world [r , i] j C W. Putting all the pieces together yields that 
W C W. 

Next, we argue that v J = v. The value v J is determined by the precondition of the net_mcast 
action, as the minimum of a set of values value [s-1, j]j, taken over all indices j in W. Because W 
C W, every such index j is in W. Since value [s-1, j]j ^ _L, a (j ,s-l,v' ' ,W J ') message is in the 
Net in the pre-state of the new net_mcast, where v J ' = value [s-1, j]j. Our assumption that the 
conclusion of the invariant is true in the pre-state then implies that value [s-1 , j] j = v. Thus, all 
the values considered in the min are equal to v, which implies that v = v ' . 

This proves part 1. Parts 2 and 3 follow from part 1 and Invariant A.l. ■ 

Invariant A. 12 7/roundj = r > 1 and mode j = running and j is not failed, then roundj > r 
- 1. 

Proof: Since j is not failed, by Invariant A.l (6), j 0f ailed [s]j for any s, so j <£± ailed-by [r-1] j. 
By Invariant A. 3(2), j $ out[s]j for any s, so j ^out-by[r-l] j. Since mode^ = running, j initi- 
ated, and by the world consistency assumption, j €world[l,i] j. By Invariant A. 7(4), j is in i's 
world for round r. Therefore, i must have received a round r-1 message from j before moving to 
round r. ■ 

A. 1.2 CUP safety guarantees 

We now prove that the CUP implementation satisfies the CUP safety guarantees, assuming the 
environment satisfies the safety assumptions. 

Theorem A.l The CUP algorithm satisfies well-formedness. 

Proof: This is straightforward from the code and the well-formedness assumptions on the en- 
vironment. For condition 1, assume that decide, occurs. Then in the preceding state, mode = 
running, mode is initially _L, and the only way it becomes running is via initj. So there must be 
a preceding initj. 

For condition 2, assume for the sake of contradiction that two decidej events occur. Part 1 
implies that an initj precedes the first decidej. The first decidej sets modej to done. After this 
point, and before the second decidej event occurs, modej must become running. This can happen 
only as a result of another initj event. This means that two initj events must occur, which 
contradicts the environment well-formedness assumption. Therefore, no more than one decidej 
event occurs. ■ 
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Theorem A. 2 The CUP algorithm satisfies uniform agreement. 

Proof: If at most one decide event occurs, the result follows immediately. So assume that 
there are at least two decide events. Consider the first decide event, decidei(v). By the pre- 
condition, we know that in the pre-state, there exists r such that world [r , i] j ^ ± and Vj E 
world[r,i]j \ out[r]j, value [r,j]j = v and world[r,j]j C world [r,i]j. Since i does not 
leave, abstain, or decide before the decide, event, we know that i ^ out [r]^ in the pre-state; there- 
fore, value [r,i]j = v. Also consider any particular later decide event, decide^ (v J ). As above, 
we know that in the pre-state of this event, there exists r' such that world [r J , i J ] ,/ ^ _L and 
Vj E world [r J ,i']j/ \ out[r']j/, value [r ', j ] ,/ = v J and world [r J ,j]j/ C world [r J , i'] j/. 
Moreover, value [r ' , i ' ] # = v ' . 

We now show that i' E world [r, i] j. Since i' decides, it initiates and does not leave or fail before 
it decides. Since i initiates before it decides, and thus before i' decides, i' does not leave or fail 
before i initiates. Then the world consistency assumption implies that i' gets put into world [1 , i] j. 
If r = 1 then we are done, so assume that r > 2. Then the value of world [r , i] j is determined in a 
net_mcastj(i,r,*,*) step. To see that i' is included in world [r,i] j, note that that set is defined 
in the net_mcastj(i,r ,*,*) step to include (at least) all processes in world[l,i]j that do not 
leave, abstain, decide, or fail before the net_mcastj (i ,r , * , *) event. And i' does not leave, abstain, 
decide, or fail by then, because this net_mcastj(i,r,*,*) event happens prior to the decide,. 

We also know that i' $ out[r]j in the pre-state of decide,. This is because i' has not left, 
abstained, or decided before the decide^. 

Next, we show that r' > r, that is, the round at which i' decides is at least as great as the 
round at which i decides. Since i' E world [r,i] j and i' £ out [r]j in the pre-state of decide^, the 
precondition for decide^ implies that value [r,i J ]j / _L in the pre-state of decide^. This means 
that i' must send an (i J ,r,v,*) message. This implies that the round r' at which i' decides is at 
least as great as r, that is, r ' > r. 

Finally, we argue that v J = v. Invariant A. 8, part 2, implies that in the pre-state of decide^, 
if j E world[r,i]j and if (j .r^' ' ,W ') is in the Net, then v J ' = value [r,i]j and W ' C 
world[r,i]i. Since r' > r and i' E world[r,i]j, Invariant A. 11, part 2, implies that in the 
pre-state of decide^/, value [r J , i'] $/ is either v or ±. Since (as noted earlier) value [r J , i']j/ = 
v', we have that v=v ' . 



Theorem A. 3 The CUP algorithm satisfies validity. 

Proof: Part 1 follows from Invariant A. 5. Part 2 follows from Invariant A. 10. ■ 

A. 1.3 CUP liveness guarantees 

We now show that CUP satisfies its liveness property — termination. Formally, the lemmas and 
theorem we state in this section should be interpreted with respect to an execution a of the 
composition of automata CUPj for i E I such that: 

1. All the environment safety and liveness assumptions are satisfied in a. 

2. a is "weakly fair" to all actions of all CUPj automata, in the sense that if an action is enabled 
from some point onward, it eventually is performed. 

35 



Lemma A.4 Let J be the set of processes that initiate and never decide, leave, or fail, and suppose 
that i E J. 7/initj(v, W) occurs and j E W then either j E J or else j abstains, leaves, decides, or 
fails. 

Proof: Follows from the init occurrence assumption. ■ 

Lemma A. 5 If process i initiates and never decides, leaves, or fails, then round; increases without 
bound. 

Proof: Let J be the set of all processes that initiate and never decide, leave, or fail. Assume for 
the sake of contradiction that, for some process i E J, roundj is bounded. Let r be the smallest 
round number such that for some process i E J, roundj is bounded by r, and fix such i E J. Process 
i cannot get stuck at round 0, because the initj action immediately increments the round to 1. So 
we may assume that r > 0. 

We argue that i cannot be stuck at round r, by showing that for some v, W, thenet_mcastj(i,r+l,v,W) 
action is eventually enabled and stays enabled. Then weak fairness implies that net_mcastj (i ,r+l , v,W) 
eventually occurs. 

We claim that the last precondition of net_mcastj(i,r+l,*,*) (the negation of the decide 
precondition) is always true. For if not, then decidej(v) would be enabled for some v, and would 
stay enabled forever. This implies, by weak fairness, that decide, occurs, a contradiction. 

Next, we claim that for every j E world[l,i], either i receives a round r message from j, or 
else i puts j into its f ailed [r J ] set or out [r J ] set for some z' < r. Fix any such j. Lemma A.4 
implies that either j 6 J or j eventually abstains, leaves, decides, or fails. If j E J then by choice 
of r, j does not get stuck at any round less than r, and so j eventually sends a round r message, 
which i eventually receives. 

If j fails, then eventually a fail_detectj(j) occurs, which makes i put j into one of its 
f ailed [r J ] sets. If r' < r then we are done; on the other hand, if r' > r then i receives a 
round r message from j. 

If j abstains and does not fail, then eventually i puts j into its out [1] set (which suffices 
because 1 < r). If j leaves or decides at a round r ' < r, then eventually i puts j into its out [r J ] 
set. Finally, if j leaves or decides at a round r' > r, then eventually i receives a round r message 
from j. 

This claim implies that eventually the precondition of net_mcastj(i,r+l,v,W) is satisfied for 
some v, W. Because the values and worlds can only decrease, eventually the precondition is satisfied, 
and remains satisfied, for the same v, W. Then weak fairness implies that the action eventually 
occurs, which moves j to round r + 1. This is a contradiction. ■ 

Lemma A. 6 Let J be the set of processes that initiate and never decide, leave, or fail, and suppose 
that i E J. Then for r sufficiently large, world [r,i]j = J. 

Proof: The result follows from two claims: that for all r, J C world[r,i]j, and that for suffi- 
ciently large r, world [r,i] j is a subset of J. 

First, we show that for all r, J C world [r,i]j. World consistency implies that J C world [l,i]j. 
Since no element of J ever abstains, leaves, fails, or decides, no element of J is ever put into any 
f ailed [r] j or out [r] j. Then the definition of world [r,i] j (in net_mcast (i,r,*,*)) implies that 
for all r, J C world[r,i]j. 

Second, we show that for sufficiently large r, world [r ,i] j is a subset of J. Let j be any element 
of world [r,i]j. Lemma A.4 implies that if j ^ J, then j eventually abstains, leaves, decides, or 
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fails. But in any of these cases, j eventually gets put into some f ailed [r] j or out [r] j. This means 
that j is excluded from world [r,i]j for sufficiently large r. ■ 

Theorem A. 7 The CUP algorithm satisfies termination. 

Proof: We prove that every process that initiates eventually decides, leaves, or fails. Assume for 
the sake of contradiction that there is at least one initiator that does not decide, leave, or fail. Let 
J be the set of processes that initiate and never decide, leave, or fail; then J is not empty. Then 
Lemma A. 5 implies that the rounds of all processes in J increase without bound, and Lemma A. 6 
implies that for sufficiently large r, world [r,i] j = J for all i E J. Thus from some round onward, 
every process in J bases its new value on values heard from exactly the members of J. 

Thereafter, each i E J eventually reaches some minimum value of value [r , i] j (by monotonicity 
and the fact that only finitely many values can be used). Consider a round beyond which all the 
minima have been attained. If these are all identical, then all processes can decide based on this 
value and world J, and we are done. On the other hand, if they are not all identical, then let ibea 
process whose minimum is larger than some other process' minimum. Then i would see a smaller 
value and reduce its value further, a contradiction. ■ 

A. 2 Atom Correctness Proof: Safety Arguments 
A. 2.1 General Invariants 

The following invariants follow immediately from the code: 
Invariant A. 13 7/ join-slotj/ oo then leave-slotj > join-slotj. 

Invariant A. 14 Suppose s E ended-slotSj. Then: 
1- If j ^ joiners [s]j then join-slotj = s. 
2. If j E leavers [s]j then leave-slotj = s. 

Proof: Process j can be inserted into joiners [s]j (leavers [s]j) only if * receives a (j , JOIN, 
s-1) (resp. ( j , LEAVE, s)) message, which can be sent only by j and only if join-slotj = s 
(resp. leave-slotj = s). ■ 

The following invariant asserts that from the join slot onward, slot messages (bulk, join, or 
leave) are multicast in order. 

Invariant A. 15 If join-slotj — 1 < s' < s and s E mcast-slotSj then s' E mcast-slotSj. 



Proof: join-slotj had to have been set before current-slot j becomes s '+1 because it is always 
chosen to be in the future. Therefore, net_mcastj(i, *, s'~) is enabled once current-slot j 
becomes s'+l. This is earlier than the time at which net_mcastj(i, *, s) can occur, so time 
could not have passed beyond that point without net_mcastj(i, *, s-1) occurring. ■ 

The following invariant is central to the rest of the proof. It asserts that by the time of 
end_slotj(s), i has all the right processes in alive [s-1] , alive [s] , joiners [s] , and joiners [s+1] . 
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Invariant A. 16 If s 6 ended- s lots , then 

1. If join-slot^ < s and s € mcast-slotSj then j € alive[s-l]jU joiners[s]j. 

2. // join-slotj < s+1 and s+1 € mcast-slotSj then j £ alive[s]jU joiners[s+l] . 



Proof: If j joined by slot s, it registered for the network before starting slot s-1. Moreover, 
if s € mcast-slotSj, then by Invariant A. 15, s-1 6 mcast-slotSj, and therefore j multicasts 
either a (j , JOIN, s-1) or a bulk message in slot s-1, and by the Net's reliable delivery property, 
this message is not lost due to j's failure because j multicasts a message in the following slot, 
which occurs @ time later, and we assume that 6 > A, and messages sent more than A time 
before the failure are not lost. Likewise, if join-slot j < s+1 and s+1 6 mcast-slotSj, then s 
6 mcast-slotSj (by Invariant A. 15), and j multicasts a bulk or join message in slot s, which is 
not lost due to j's failure. 

We will now show two things: first, that i joined early enough to get j's slot s-1 bulk or join 
message; and second, that end_slotj(s) occurred late enough for i to have received j's slot s bulk 
or join message. 

Since i does end_slot for s, join-slot j < s. Process i chooses its join-slot following the 
net_join_OKj to be current-slotj + 2 + |T/0], so current-slotj becomes s-1 at least T time 
after the net_join_OKj. Since the maximum clock difference between i and j is T, j sends its 
message (join or bulk) for slot s-1 no earlier than the time of the net_join_OKj, so i joined early 
enough to get j's message for slot s. 

It is left to show that i gets j's slot s bulk or join message for slot s before end_slotj(s). This 
follows from the precondition for end_slotj which asserts that clockj > (s + 1)@ + A + T. That 
is, that at least A + F time has elapsed since slot s+1 has begun at i. Since the clock difference 
between i and j is at most T, we get that at least A time has elapsed since slot s+1 has begun at 
j. Since j sends its slot s message once slot s+1 begins at j, and the network latency is bounded 
by A, the message reaches i before end_slotj(s). ■ 

The following invariants are related to the suspects [s] sets. 

Invariant A. 17 // suspects [s]j is not empty, then join-slotj < s. 

Proof: suspects [s]j gets set only upon end_slotj(s), for which this is a precondition. Once 
join-slotj is set to a non-oo value, it does not change, by the singularity of join and net_join_0K. 



Invariant A. 18 If j e suspects [s]j then j has failed. 

Proof: Since j gets inserted to suspects [s]j during end_slotj(s), j is in (alive [s-1] j U 
joiners[s]j \ leavers[s]j) \ alive[s]j). In particular, j is in alive[s-l]jU joiners[s]j, 
and therefore join-slotj < s. Moreover, j is not in alive [s]j, so by the contrapositive of In- 
variant A. 16(2), s+1 mcast-slotSj, which implies that j either fails or leaves before sending a 
slot s+1 message. Since j is also not in leavers [s]j, j must have failed. ■ 

Invariant A. 19 If j € suspects[s]j and j E suspects [s'Jj/ then \s' - s| < 1. 
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Proof: Without loss of generality, assume s ' > s. Since j E suspects [s]j, then j €alive [s-l]jU 
joiners[s]j, and therefore join-slot j < s. Moreover, j is not in alive [s], when s E ended-slots, 
so by the contrapositive of Invariant A. 16(2), s+1 mcast-slotSj. By Invariant A. 15, for any 
slot r > s, r mcast-slotSj, and therefore j alive [r] ,/ for any r > s. Since i' suspects j in 
slot s' , j is in alive [s'-l],', and therefore s'-l < s. ■ 

The following invariant states that a process does not abstain from CUP instances pertaining 
to processes that it suspects. 

Invariant A. 20 If k E suspects [s]j and CUP-status [k] j = done then CUP-dec-val [k] j ^ _L. 



Proof: Assume by contradiction that the invariant is false. Since CUP-status [k], = done while 
CUP-dec-val [k]j = _L, then i must have performed CUP(k) .abstain,. By the precondition for 
CUP (k) . abstain^, CUP-status [k] j was req when abstain, occurred, which implies that end_slot (s) j 
could not have already occurred, that is, all the slots in ended-slotSj were smaller than s 
at the time of CUP (k) .abstairij. By the precondition for CUP(k) .abstain,, when it occurred, 
CUP-req-val [k] j had some non-± value, v, such that v < s-1. This, in turn, implies that a 
(CUP-INIT, k) message with slot v had previously arrived. That means that such a message was 
previously sent by some j, which implies that k is added to suspects [v]j, during end_slot j (v) , 
and remains there henceforward. But k E suspects [s]j and v < s-1, a contradiction to Invari- 
ant A. 19. ■ 

Invariant A. 21 Ifk E alive [s] ,, k alive [s]j, and s E ended-slotSj then k E suspects[s] j. 
Moreover, if s+1 E ended-slotSj then k E suspects [s+1] j. 

Proof: Since k E alive [s]j, we know that join-slot* < s < leave-slot* and that s E 
mcast-slots*. Therefore, by Invariant A. 16(1), if s E ended-slotSj then k E alive [s-1] jU 
joiners [s] j. Additionally, k is neither in leavers [s]j nor in leavers [s]j, because it does 
not leave at slot s. Therefore, since k £ alive [s]j, in end_slotj(s), k gets inserted into 
suspects [s] j. 

Since k alive [s]j, by the contrapositive of Invariant A. 16(2), we get that s+1 mcast-slots*. 
That is, k does not send a bulk or leave message for slot s+1. Therefore, k alive [s+1] jU 
leavers [s+1] j when end_slotj(s+l) occurs, and k gets inserted into suspects [s+1] j when s+1 
is inserted to ended-slotSj. ■ 

Invariant A. 22 Ifk E alive[s]j and s+1 E ended-slots, and CUP-dec-val [k]j < s then k E 
suspects [s+1] j. 

Proof: Since k E alive [s]j, join-slot*. < s < leave-slot*. Since CUP-dec-val [k]j <, 
then by the validity property of CUP, some process I must have initiated CUP(k) with an ini- 
tial value s' < s. This implies that k E suspects [s']j, and therefore k alive [s']j and s' 
E ended-slotSj By contrapositive of Invariant A. 16, s'+l £ mcast-slots*, and therefore also 
s'+l g mcast-slots*. So i does not hear a bulk or leave message from k for slot s+1, and k E 
suspects [s+1] j. ■ 

Lemma A. 8 Assume that for some processes j 7 k,l CUP(k) .init^(v, W) occurs with j €W ; and 
that CUP(k) .initj(v J , W J ) also occurs. Then net_j oin_0Kj has occurred before CUP (k) .initj(v J , 
W). 
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Proof: By the precondition for CUP(k) .init, k € suspects[v J ] j and k € suspects [v]/, so by 
Invariant A. 19, v' > v — 1. When CUP(k) . init^(v, W) occurs, W = alive [v]^U joiners [v+l]^. 
Since j €W, this implies that join- slot j < v+1 < v'+2. Assume CUP (k) . init j (v ' , W J ) occurs 
at time t. So at time t, v J € ended-slotSj. By the precondition for encLslot j (v ' ) , at time t 
clocks > (i>' + l)0 + A + r. Since v'+l > join-slotj —1, at this time clocks > (join-slotj — 
1)0 + A + T. Since the clock skew is bounded by T, at time t, clockj > (join-slotj — 1)0 + A. 
So t is at least A time after j begins slot join-slot j — 1. But join-slot j is chosen to be at 
least 2 slots after the slot at which net_join_OKj occurs at j, so j begins join-slot j — 1 after the 
net_join_OKj, i.e., before time t. ■ 

A. 2. 2 Safety environment conditions for CUP 

Well-formedness CUP (k) . init j only occurs when k becomes suspected at i. Once k is sus- 
pected, it is never again alive. Therefore, it is never suspected again and CUP(k) .initj occurs at 
most once. By Invariant A. 20, since k is suspected at i, i does not abstain. Thus, at most one 
initj or abstainj event occurs. 

The fact that at most one leave, event occurs and at most one f ailj event occurs is ensured 
by the application, since leave and fail actions are routed directly from the application to all 
instances of CUP. 

The fact that no f ailj precedes an initj follows from the fact that failures affect all components 
and processes do not take any steps after they fail. 

One of the preconditions encLslotj is that leave-slotj = oo, that is, that leavej did not 
occur. Therefore, no leavej precedes an initj. 

World consistency Assume that CUP(k) . initj (s, W) occurs at time t, j does not leave or fail 
before time t, and CUP(k) . initj (s J , *) also occurs. We need to show that j €W. 

CUP(k) . initj (s, W) is triggered during encLslotj (s). We need to show that at this time j € 
alive [s] jU joiners [s+1] j. This is true if i receives j's slot s bulk or join message. 

By the precondition for encLslotj (s), clockj > (s+l)0 + A + T at time t. Since the difference 
between a process clock and real time is at most T/2, the real time associated with point t is at 
least (s+l)0 + A + T/2. By assumption, j does not fail or leave until this time. 

By Invariant A. 19, s' < s+1. When CUP (k) . initj (s J ,*) occurs, k €suspects[s J ] j. By 
Invariant A. 17, join-slotj < s' . Together these two inequalities imply that join-slot^ < s+1. 
Therefore, if j does not fail or leave before clockj becomes s+10, j multicasts its slot s bulk or join 
message when clockj = (s+l)0 (a join message is multicast if join-slotj = s+1; otherwise j 
multicasts a slot s bulk message). When clockj = (s+l)0, the real time is at most (s+l)0 + T/2. 
If j does not fail until the real time becomes (s+l)0 + T/2 + A, then j's message is not lost, and i 
receives it by time (s+l)0 + T/2 + A. But we assume that j does not fail or leave until this time. 

Accurate failure detector CUP(k) .fail_detectj(j) occurs only if for some slot sj €suspects [s], 
Therefore, by Invariant A. 18, j has previously failed. Moreover, since a process is never again alive 
after it is suspected, it is never again suspected, and CUP(k) . f ail_detectj(j) does not recur. 

Accurate leave detector CUP(k) .leave_detectj(j) occurs only if a LEAVE message is re- 
ceived from j; j sends at most one such a message and only if it actually leaves. 
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Lossless leave Assume a CUP process at j multicasts a message ra, and subsequently, leave^ 
occurs. When leave j occurs, a LEAVE message is inserted to out-buf j to be sent in the ensuing 
slot. This LEAVE message is multicast after ra. leave_detectj(j) occurs when this LEAVE 
message is received. By the FIFO property of Net, netjrcvj(m) occurs beforehand. 

A. 2. 3 Proving the total order property 

We now prove that all the process deliver messages in a consistent total order. We define the total 
order S as follows: Let P s be the union of all sets P such that an action members (P, s)j occurs. 
The set of messages S s is defined to be those messages included in slot s bulk messages by processes 
in P s . The set of messages in S is defined to be the union of all sets S s . 

The ordering is based on slots, so that for s < s ' , all messages in S s precede all messages in 
S' s . For messages pertaining to the same set S s , the ordering is by process indices. For the same 
slot and process index, the ordering is the temporal order of sending (at the external boundary of 
Atom). 

We have to show that every process delivers a contiguous subsequence of S. We first prove 
Lemma A. 9, asserting that every two processes that perform a members (P, s) action for a slot 
s do so with the same membership set P. As part of this action, processes deliver messages for 
slot s. Next, we prove Lemma A. 10, asserting that if a process i performs memberSj(P, s) with 
j GP, then i has received a bulk message for slot s from j, and therefore triggers the delivery of 
all the messages included in it as an effect of the memberSj(P, s) action. Thus, every process 
that performs members (P, s), triggers the delivery of all the messages in S s . These messages are 
delivered in order of the sender's process index, and for each process, in FIFO order. Therefore, 
these messages are delivered in the order defined on S s . 

Since every process performs members (P , s) for a contiguous subsequence of slots, every process 
delivers a contiguous subsequence of the messages in S. 

We now prove the lemmas: 

Lemma A. 9 // member s(P,s), and members (P J ,s)j occur, thenP= P'. 

Proof: Let A; be a process in P. At the time member s(P,s)j occurs, k E alive [s]j, s+1 6 
ended-slotSj, and CUP-dec-val [k] j is either ± or larger than s. Assume by way of contradiction 
that k P J , then either k alive [s]j or CUP-dec-val [k] j < s when members(P J ,s)j occurs. 

Assume first that k alive [s]j. Note that s E ended-slotSj when member s (P J ,s)j oc- 
curs, so by Invariant A. 21, k E suspects[s]j at the time members(P J ,s)j occurs. By the pre- 
condition for members (P J ,s)j, CUP-status [k]j = done when it occurs, and by Invariant A. 20, 
CUP-dec-val [k] j / _L, that is, CUP(k) .decide^O) occurred for some v and set CUP-dec-val [k] j 
= v. By the well-formedness property of CUP, j initiated CUP(k). Since k E suspects [s]j, k 
cannot be included in suspects Is'^j for any s' ^ s, and so j initiated CUP(k) with s. By the 
validity condition of CUP, v < s. 

Since s+1 E ended-slots^ when members (P,s)j occurs, by Invariant A. 21, k E suspects [s+1], 
at this time. Therefore, by the precondition for members (P,s)j, CUP-status [k]j = done. By A. 20, 
CUP-dec-val [k] j ^ _L, that is, CUP(k) .decidej occurred, and by the uniform agreement property, 
CUP-dec-val [k]j = CUP-dec-val [k]., < s. A contradiction. 

Now, assume that CUP-dec-val [k] j < s when members (P ' , s) j occurs. Since s+1 E ended-slotSj 
when members (P,s)j occurs, by Invariant A. 22, k E suspects [s+1] j at this time. Therefore, by 
the precondition for members (P,s)j, CUP-status [k]j = done. By Invariant A. 20, CUP-dec-val [k]j / 
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-L, that is, CUP(k) . decidej occurred, and by the uniform agreement property, CUP-dec-val [k] j = 
CUP-dec-val [k] j < s. A contradiction. ■ 

Lemma A. 10 If member Sj(P,s) occurs, then for every j EP, i received a bulk message for slot s 
from j prior to the memberSj(P,s) action. 

Proof: Assume memberSj(P,s) occurs. Since j EP, by the precondition for memberSj(P,s), j E 
alive [s] j. By definition of alive [s] , in-buf [s, j] ^ _L , that is, i received a bulk message from 
j for slot s. ■ 

A. 3 Atom Correctness Proof: Liveness Arguments 

In the liveness proof, we can use the safety guarantees of CUP, since they depend only on the safety 
assumptions about the environment. 

A. 3.1 General liveness lemmas 

Lemma A. 11 Time passes, current-slotj increases through all slot values from zero onward, as 
long as i does not fail. 

Lemma A. 12 Ifi does not leave or fail, then end_slotj(s) occurs for every slot s > join-slotj. 
Lemma A. 13 Ifi leaves and does not fail, then eventually i multicasts a (i, LEAVE, s) message. 

A. 3. 2 Liveness environment conditions for CUP 

Init occurrence Assume that initj (s, W) event occurs and j E W, and neither i nor j leaves or 
fails. 

Since j €W, j 6alive [s] U joiners [s+1] at the time initj (s, W) is triggered, which means 
that net_join_0Kj had already occurred prior to the initj(s, W) event. When initj(s, W) is 
triggered, i multicasts an (CUP-INIT, k) message. Since neither i nor j leaves or fails, j receives 
this message. 

Consider the pre-state value of CUP-status [k] when the (CUP-INIT, k) message from i arrives 
at j. If CUP-status [k] is running or done, then either CUP(k) .initj or CUP(k) .abstain^ had 
to have already occurred and we are done. Otherwise, after this step CUP-status [k] = req, and 
CUP-req-val [k] =v. Since j does not leave or fail, by Lemma A. 12, it eventually has slots larger 
than v in ended-slots^, so either CUP(k) . initj (*) or abstain CUP(k) .abstain^ becomes enabled, 
depending on whether k is suspected in some slot or in none. 

Reliable delivery Assume that for some processes j, k 7 l CUP(k) .init/(v, W) occurs with j €W, 
and that either CUP (k) .init j(v J , W J ) or CUP(k) . abstain^ occurs. We will show that by the 
time that either CUP (k) .init j(v J , W) or CUP(k) . abstairij occurs, net_join_0Kj had already 
occurred. This will imply that for any net_mcastj(m) that occurs after this event, a netjrcvj(m) 
will occur unless either i will fail, or j will fail or leave. 

If CUP (k) . init j(v J , W J ) occurs, by Lemma A. 8, net_join_0Kj occurs first. Now, consider 
the case that CUP(k) .abstainj occurs. Process i can only abstain after it receives an (CUP- 
INIT, k) message which could have only been sent if some other process i' has already triggered 
CUP(k) . initj/. By Lemma A. 8, net_join_0Kj must have occurred before the CUP(k) . initj' event. 
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Complete leave and failure detector If CUP (k) . init j (v , W) occurs with j e W, then j e alive [v] j 
U joiners [v+1] . Assume that i does not decide or leave or fail. Then CUP-status[k] % remains 
running from the time of the CUP (k) . init j (v , W) event onward. If leave j occurs, j sends a LEAVE 
message which i receives. When i receives j's LEAVE message, i triggers leave_detectj(j) and 
we are done. Otherwise, assume j does not leave and f ailj occurs, then eventually there is a 
slot for which i does not receive j's messages. Let s be the first such slot, so j 0alive[s]j 
while j €alive[s-l]j U joiners[s]j, so since j does not leave in s, j €suspects [s] . Since 
j ealive [v] j U joiners [v+1] , s > v, and i triggers fail_detectj (j ) while performing end_slotj (s) . 

A. 3. 3 Liveness of Atom 

Eventual join Assume no failj occurs. When join, occurs, net_joinj is triggered, and by 
fairness, eventually occurs. By the eventual join property of Net, net_join_OKj eventually occurs. 
At that point, join-slot j is set to be bigger than current-slot j. join-slot j does not change from 
that point onward, since by the join integrity property of Net, no more net_join_OKj events occur. 
By Lemma A. 11, current-slotj eventually becomes equal to join-slotj. When that happens, 
join_0Kj becomes enabled, and remains enabled, as long as no time passes, until it occurs. By 
our assumption on time passage, no time passes until join_0Kj occurs. Therefore, by fairness, it 
eventually occurs. 

Eventual leave Assume no f ailj occurs. When leave, occurs, leave-slotj is set to be bigger 
than current-slotj. leave-slotj does not change from that point onward, since by our assump- 
tion on the application, no more leavej events occur. By Lemma A. 13, i eventually multicasts a (i , 
LEAVE, leave-slotj) message, at which point leave-slotj is added to mcast-slotSj. When that 
happens, net_leavej becomes enabled and remains enabled until it occurs. Then, by the eventual 
leave property of Net, net_leave_OKj eventually occurs and triggers leave_0Kj. 

Message delivery The following lemma asserts that a process that participates in the algorithm 
and does not leave or fail continues to perform members (P, s) forever. 

Lemma A. 14 //mcastj(m) occurs for some m when s = current-slotj, and no f ailj or leavej 
occurs, then for every s' > s, membersj(P, s J ) occurs. 

Proof: Since mcastj(m) occurs, by our assumption about the application, it is preceded by a 
join_0Kj. Therefore, ra is appended to out-buf [s]j. By Lemma A. 11, current-slotj becomes 
s+1, and so i eventually sends its bulk message for slot s with ra included in it. By liveness of Net, 
net_rcvj(i, m' , s) occurs, where ra' is ?'s slot s bulk message. 

By Lemma A. 12, end_slotj(s) occurs for every slot s > join-slotj. The sets suspects [s] j 
and suspects [s+1] j are set when end_slotj(s) (resp. end_slotj(s+l)) occurs, at which point 
a CUP instance for each process in these sets is initiated, and the set do not change afterwards. 
By the termination property of CUP, these instances of CUP eventually terminate, setting the 
corresponding CUP-status to done. Therefore, members (P, s)j eventually becomes enabled for 
some P, and by fairness, occurs. ■ 

We now prove that the message delivery liveness property holds. 

Assume mcast j (m) occurs, and no f ailj or leavej occurs. We first show that S contains ra and 
rcvj(m) occurs. Let s = current-slotj when mcast j(m) occurs. By Lemma A. 14, members (P, 
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s)j occurs. We now show that i E P. This will imply that m E S (by definition of S), and that 
rcvj(m) occurs (since it is triggered by members (P, s)j). 

To show that i 6P, we have to show that j E alive [s]j and CUP-dec-val [i] j = _L at the 
time members (P, s)j occurs. Since at this time s+1 E ended-slotSj, by Invariant A. 16, j E 
alive [s]j. By Invariant A. 18, since i does not fail it never becomes a suspect, and therefore, no 
instance of CUP is run for i, and CUP-dec-val [i] j = _L. 

It remains to show that for every ra' that follows m in S, rcv^On') also occurs. By definition 
of S, m' is included in a bulk message for some slot s ' > s from some process i', such that i' E P' 
and memberSj(P J , s J ) occurs for some j. By Lemma A. 14, members (P J ' , s J )i also occurs, and 
by Lemma A. 9, P J = P' ' . Therefore, rcv^Cm') is triggered by the members (P J ' , s J )j action. 
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